This week's book giveaway is in the Jobs Discussion forum.
We're giving away four copies of Soft Skills and have John Sonmez on-line!
See this thread for details.
The moose likes I/O and Streams and the fly likes baffling UTF-8 problem Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "baffling UTF-8 problem" Watch "baffling UTF-8 problem" New topic
Author

baffling UTF-8 problem

Joe Simone
Greenhorn

Joined: Feb 16, 2005
Posts: 25
I have an XML file which was created by me by serializing out a Document object and specifying an OutputFormat of "UTF-8". I did this using Xerces.

It has french in it and upon using the TextPad editor to look at the file, it looks right. If I use the binary view mode of TextPad, I can see the accented french chars E8, E9 and EA. The header in the XML, when serialized out of the Document object it came from as "<?xml version="1.0" encoding="UTF-8"?>".

So my assumption is the XML file is correctly UTF-8. But is it? Should the accented chars be represented as 00 E8, 00 E9 and 00 EA as defined by the UTF-8 spec for all code points above 7F?

So the baffling aspect of this comes in when I try and read this file in. I use this code ...


or alternately in this way ...



Well, in either case, all I get in the Eclipse console is text with a bunch of question marks where the accented french chars should be.

I am baffled as to WHY I can't get UTF-8 to successfully round trip.

For completeness I give the XML serialization code below ...



I seem to have considered UTF-8 encoding all through the code, but for some reason ... it doesn't work.

Many thanks to anyone who can shed light on what I am doing wrong.

Kind regards,
Joe
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18987
    
    8

Your InputStreamReader has the correct encoding specified. But I looked all through your code and I couldn't find where you wrote your XML to a file. Declaring the encoding in your serializer is all very well but if you follow up by writing its output in some other encoding then you have shot yourself in the foot.

When I'm processing XML I prefer to give the parser the input in bytes and let the serializer write bytes to the target. That way they can handle the encoding themselves, which they do properly.
Joe Simone
Greenhorn

Joined: Feb 16, 2005
Posts: 25
Wow! That was it. The XML string coming out of the parser was correct but when I was writing it to a file, the encoding got whacked.

I put this in ...


... to write the file and now have UTF-8 output that is correct and can be correctly read in.

Thank-you. Thank-you. Thank-you.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: baffling UTF-8 problem