This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes I/O and Streams and the fly likes Reading a UTF-8 file & Writing it to a UTF-8 file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Reading a UTF-8 file & Writing it to a UTF-8 file" Watch "Reading a UTF-8 file & Writing it to a UTF-8 file" New topic
Author

Reading a UTF-8 file & Writing it to a UTF-8 file

Chengwei Lee
Ranch Hand

Joined: Apr 02, 2004
Posts: 884
Hi guys,

I'm trying to read an XML file that has Chinese characters in it & then output it to another XML file. But my output displays ??? when I open it with wordpad/IE.

Using WordPad, I pasted some Chinese characters which I got from website & saved the file as an Unicode document (infile.xml). The input file displays the characters correctly when I opened it with wordpad/IE.

I'd specified the encoding type to be UTF-8 for the output. Am I missing something here?

Any help is much appreciated. Thanks.



SCJP 1.4 * SCWCD 1.4 * SCBCD 1.3 * SCJA 1.0 * TOGAF 8
Jeff Bosch
Ranch Hand

Joined: Jul 30, 2003
Posts: 804
Hi, Cheng -

Chinese characters fall under the Unicode 16-bit character set. By opening your read file as UTF-8, you are in effect splitting the first 8 bits from the input character thus destroying the format of the original 16-bit character. So, if I understand what's going on here correctly, you're reading a 16-bit Unicode character as two 8-bit characters. Possibly, when you re-write the 8-bit characters, the order or endian-ness of the characters gets reversed.

Hope that helps. (Or even makes sense. I need another cup of coffee...)


Give a man a fish, he'll eat for one day. Teach a man to fish, he'll drink all your beer.
Cheers, Jeff (SCJP 1.4, SCJD in progress, if you can call that progress...)
Chengwei Lee
Ranch Hand

Joined: Apr 02, 2004
Posts: 884
Hi folks,

Am still having problems with trying to read a chinese file & output it to another chinese file. This time round, I've used the GB2312 character encoding instead.

The codes remain largely the same, but my input & output files are now htm.



Somehow, when I view the outfile.htm using IE, it still cannot display the characters properly.

My infile.htm simply consists of 2 characters: 密码

What am I doing wrong here or missing?

Any suggestions or help is greatly appreciated.

Thanks!
 
Don't get me started about those stupid light bulbs.
 
subject: Reading a UTF-8 file & Writing it to a UTF-8 file
 
Similar Threads
Problem with processing data files of size larger than 350 MB
How to parse an XML document containing Chinese characters and get an XML bean
Problems With Japanese Text
Usage of inputTextarea in JSF
notepad issue