aspose file tools*
The moose likes I/O and Streams and the fly likes Reading a UTF-8 file & Writing it to a UTF-8 file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Reading a UTF-8 file & Writing it to a UTF-8 file" Watch "Reading a UTF-8 file & Writing it to a UTF-8 file" New topic
Author

Reading a UTF-8 file & Writing it to a UTF-8 file

Chengwei Lee
Ranch Hand

Joined: Apr 02, 2004
Posts: 884
Hi guys,

I'm trying to read an XML file that has Chinese characters in it & then output it to another XML file. But my output displays ??? when I open it with wordpad/IE.

Using WordPad, I pasted some Chinese characters which I got from website & saved the file as an Unicode document (infile.xml). The input file displays the characters correctly when I opened it with wordpad/IE.

I'd specified the encoding type to be UTF-8 for the output. Am I missing something here?

Any help is much appreciated. Thanks.



SCJP 1.4 * SCWCD 1.4 * SCBCD 1.3 * SCJA 1.0 * TOGAF 8
Jeff Bosch
Ranch Hand

Joined: Jul 30, 2003
Posts: 804
Hi, Cheng -

Chinese characters fall under the Unicode 16-bit character set. By opening your read file as UTF-8, you are in effect splitting the first 8 bits from the input character thus destroying the format of the original 16-bit character. So, if I understand what's going on here correctly, you're reading a 16-bit Unicode character as two 8-bit characters. Possibly, when you re-write the 8-bit characters, the order or endian-ness of the characters gets reversed.

Hope that helps. (Or even makes sense. I need another cup of coffee...)


Give a man a fish, he'll eat for one day. Teach a man to fish, he'll drink all your beer.
Cheers, Jeff (SCJP 1.4, SCJD in progress, if you can call that progress...)
Chengwei Lee
Ranch Hand

Joined: Apr 02, 2004
Posts: 884
Hi folks,

Am still having problems with trying to read a chinese file & output it to another chinese file. This time round, I've used the GB2312 character encoding instead.

The codes remain largely the same, but my input & output files are now htm.



Somehow, when I view the outfile.htm using IE, it still cannot display the characters properly.

My infile.htm simply consists of 2 characters: 密码

What am I doing wrong here or missing?

Any suggestions or help is greatly appreciated.

Thanks!
 
 
subject: Reading a UTF-8 file & Writing it to a UTF-8 file