I have an XML file which was created by me by serializing out a Document object and specifying an OutputFormat of "UTF-8". I did this using Xerces.
It has french in it and upon using the TextPad editor to look at the file, it looks right. If I use the binary view mode of TextPad, I can see the accented french chars E8, E9 and EA. The header in the XML, when serialized out of the Document object it came from as "<?xml version="1.0" encoding="UTF-8"?>".
So my assumption is the XML file is correctly UTF-8. But is it? Should the accented chars be represented as 00 E8, 00 E9 and 00 EA as defined by the UTF-8 spec for all code points above 7F?
So the baffling aspect of this comes in when I try and read this file in. I use this code ...
or alternately in this way ...
Well, in either case, all I get in the Eclipse console is text with a bunch of question marks where the accented french chars should be.
I am baffled as to WHY I can't get UTF-8 to successfully round trip.
For completeness I give the XML serialization code below ...
I seem to have considered UTF-8 encoding all through the code, but for some reason ... it doesn't work.
Many thanks to anyone who can shed light on what I am doing wrong.
Your InputStreamReader has the correct encoding specified. But I looked all through your code and I couldn't find where you wrote your XML to a file. Declaring the encoding in your serializer is all very well but if you follow up by writing its output in some other encoding then you have shot yourself in the foot.
When I'm processing XML I prefer to give the parser the input in bytes and let the serializer write bytes to the target. That way they can handle the encoding themselves, which they do properly.
Joined: Feb 16, 2005
Wow! That was it. The XML string coming out of the parser was correct but when I was writing it to a file, the encoding got whacked.
I put this in ...
... to write the file and now have UTF-8 output that is correct and can be correctly read in.