aspose file tools
The moose likes I/O and Streams and the fly likes charset conversion CP1252 to UTF-16 Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Reply Bookmark "charset conversion CP1252 to UTF-16" Watch "charset conversion CP1252 to UTF-16" New topic
Author

charset conversion CP1252 to UTF-16

swapnel surade
Ranch Hand

Joined: Mar 05, 2009
Posts: 124
Hello,

Im using i-net PDF Content Comparer v1.10. after comparison when i try to read the difference string.
That string is having CP-1252 char format. but java recognize only utf format in this process I'm losing the characters.
What is the correct way to conversion from CP-1252 to UTF-16 or UTF-8 without losing the chars.

Thanks
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 13842

Well, no, a String doesn't have an encoding or a charset. An array of bytes (or something like that, like a file) will have a charset, if it represents text, but when you convert that array to a String you interpret according to some charset. If you don't specify one, then your system default will be used. Likewise when you convert a String to bytes, you will again be using a charset.

So your question is not on the right track. Perhaps you could post some code, if you can't figure out where the incorrect encoding or decoding is taking place?
swapnel surade
Ranch Hand

Joined: Mar 05, 2009
Posts: 124
Hi,

When i get the string its look like this
1st string : Text "‐000001875‐0/000" was changed to "‐000001893‐0/000"
but when i print or use this string for comparison its look like this
2nd string : Text "?000001875?0/000" was changed to "?000001893?0/000"

I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.

When i convert this string into UTF-8 or 16 then special character is converted to '?'

I should get hyphen in second string.

Following is the code snippet


In above code when i get value from getDescription() method, I'm getting the special char. but when i used the getBytes("CP1252")
in that byte array its converting that special char into ?

am i using wrong charset ?


Ireneusz Kordal
Ranch Hand

Joined: Jun 21, 2008
Posts: 423
swapnel surade wrote:

I checked the charset format for 1st string it is showing CP1252 and i'm not getting hyphen '-' its a different char than hyphen.


Please post a char code of this 'hyphen'.
Isn't it a 'hyphen' copied from the MS-Word document using copy-paste ?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 13842

I would just throw both of those lines of code away.

The first line says: Convert this string to bytes using the CP-1252 charset.

The second line says: Convert these bytes to a string assuming that the UTF-8 charset was used to encode the bytes.

So clearly the second line is going to cause trouble, because it's using an assumption which is false. The way to fix that is to just leave the string alone and not do either of those lines of code.
 
IntelliJ Java IDE
 
subject: charset conversion CP1252 to UTF-16
 
Threads others viewed
UTF-8 & UTF-16 Format
String functions and ISO-8895-1 encoding.
NIO problem.
How to find encoding of byte[]
developer file tools

cast iron skillet 49er

more from paul wheaton's glorious empire of web junk: cast iron skillet diatomaceous earth rocket mass heater sepp holzer raised garden beds raising chickens lawn care CFL flea control missoula heat permaculture