I'm trying to read an XML file that has Chinese characters in it & then output it to another XML file. But my output displays ??? when I open it with wordpad/IE.
Using WordPad, I pasted some Chinese characters which I got from website & saved the file as an Unicode document (infile.xml). The input file displays the characters correctly when I opened it with wordpad/IE.
I'd specified the encoding type to be UTF-8 for the output. Am I missing something here?
Chinese characters fall under the Unicode 16-bit character set. By opening your read file as UTF-8, you are in effect splitting the first 8 bits from the input character thus destroying the format of the original 16-bit character. So, if I understand what's going on here correctly, you're reading a 16-bit Unicode character as two 8-bit characters. Possibly, when you re-write the 8-bit characters, the order or endian-ness of the characters gets reversed.
Hope that helps. (Or even makes sense. I need another cup of coffee...)
Give a man a fish, he'll eat for one day. Teach a man to fish, he'll drink all your beer.
Cheers, Jeff (SCJP 1.4, SCJD in progress, if you can call that progress...)
Joined: Apr 02, 2004
Am still having problems with trying to read a chinese file & output it to another chinese file. This time round, I've used the GB2312 character encoding instead.
The codes remain largely the same, but my input & output files are now htm.
Somehow, when I view the outfile.htm using IE, it still cannot display the characters properly.
My infile.htm simply consists of 2 characters: 密码
What am I doing wrong here or missing?
Any suggestions or help is greatly appreciated.
subject: Reading a UTF-8 file & Writing it to a UTF-8 file