• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

baffling UTF-8 problem

 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have an XML file which was created by me by serializing out a Document object and specifying an OutputFormat of "UTF-8". I did this using Xerces.

It has french in it and upon using the TextPad editor to look at the file, it looks right. If I use the binary view mode of TextPad, I can see the accented french chars E8, E9 and EA. The header in the XML, when serialized out of the Document object it came from as "<?xml version="1.0" encoding="UTF-8"?>".

So my assumption is the XML file is correctly UTF-8. But is it? Should the accented chars be represented as 00 E8, 00 E9 and 00 EA as defined by the UTF-8 spec for all code points above 7F?

So the baffling aspect of this comes in when I try and read this file in. I use this code ...


or alternately in this way ...



Well, in either case, all I get in the Eclipse console is text with a bunch of question marks where the accented french chars should be.

I am baffled as to WHY I can't get UTF-8 to successfully round trip.

For completeness I give the XML serialization code below ...



I seem to have considered UTF-8 encoding all through the code, but for some reason ... it doesn't work.

Many thanks to anyone who can shed light on what I am doing wrong.

Kind regards,
Joe
 
Marshal
Posts: 28177
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your InputStreamReader has the correct encoding specified. But I looked all through your code and I couldn't find where you wrote your XML to a file. Declaring the encoding in your serializer is all very well but if you follow up by writing its output in some other encoding then you have shot yourself in the foot.

When I'm processing XML I prefer to give the parser the input in bytes and let the serializer write bytes to the target. That way they can handle the encoding themselves, which they do properly.
 
Joe Simone
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Wow! That was it. The XML string coming out of the parser was correct but when I was writing it to a file, the encoding got whacked.

I put this in ...


... to write the file and now have UTF-8 output that is correct and can be correctly read in.

Thank-you. Thank-you. Thank-you.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic