| Author |
charset conversion CP1252 to UTF-16
|
swapnel surade
Ranch Hand
Joined: Mar 05, 2009
Posts: 124
|
|
Hello,
Im using i-net PDF Content Comparer v1.10. after comparison when i try to read the difference string.
That string is having CP-1252 char format. but java recognize only utf format in this process I'm losing the characters.
What is the correct way to conversion from CP-1252 to UTF-16 or UTF-8 without losing the chars.
Thanks
|
 |
Paul Clapham
Bartender
Joined: Oct 14, 2005
Posts: 13842
|
|
Well, no, a String doesn't have an encoding or a charset. An array of bytes (or something like that, like a file) will have a charset, if it represents text, but when you convert that array to a String you interpret according to some charset. If you don't specify one, then your system default will be used. Likewise when you convert a String to bytes, you will again be using a charset.
So your question is not on the right track. Perhaps you could post some code, if you can't figure out where the incorrect encoding or decoding is taking place?
|
 |
swapnel surade
Ranch Hand
Joined: Mar 05, 2009
Posts: 124
|
|
Hi,
When i get the string its look like this
1st string : Text "‐000001875‐0/000" was changed to "‐000001893‐0/000"
but when i print or use this string for comparison its look like this
2nd string : Text "?000001875?0/000" was changed to "?000001893?0/000"
I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.
When i convert this string into UTF-8 or 16 then special character is converted to '?'
I should get hyphen in second string.
Following is the code snippet
In above code when i get value from getDescription() method, I'm getting the special char. but when i used the getBytes("CP1252")
in that byte array its converting that special char into ?
am i using wrong charset ?
|
 |
Ireneusz Kordal
Ranch Hand
Joined: Jun 21, 2008
Posts: 423
|
|
swapnel surade wrote:
I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.
Please post a char code of this 'hyphen'.
Isn't it a 'hyphen' copied from the MS-Word document using copy-paste ?
|
 |
Paul Clapham
Bartender
Joined: Oct 14, 2005
Posts: 13842
|
|
I would just throw both of those lines of code away.
The first line says: Convert this string to bytes using the CP-1252 charset.
The second line says: Convert these bytes to a string assuming that the UTF-8 charset was used to encode the bytes.
So clearly the second line is going to cause trouble, because it's using an assumption which is false. The way to fix that is to just leave the string alone and not do either of those lines of code.
|
 |
 |
|
|
subject: charset conversion CP1252 to UTF-16
|
|
|