Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

NLS characters lost when storing xml from java to filesystem

 
Anupam Bhatt
Ranch Hand
Posts: 81
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I am storing a XML document which has some non-ascii characters, from java to filesytem. I specify the XML document's encoding as UTF-8 and then save it to the filesystem.
But when i retrieve the document back in to java i find all the non-ascii characters lost and represented as [???].

What can be done in such scenarios? My guess is the hosting machine where the XML document is stored, should have default encoding as UTF-8 ? Please comment and guide.

If the above is true, it would be a pain to do these settings on all the machine where the application is run.

I expect there should be a easier solution for this. Any ideas, help is appreciated.

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
[ June 14, 2007: Message edited by: Anupam Bhatt ]
 
Ernest Friedman-Hill
author and iconoclast
Marshal
Pie
Posts: 24211
35
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Reader needs to know the encoding if UTF-8 isn't the platform default; it won't learn this from the XML file itself. For example, you might use

Reader rdr = new InputStreamReader(new FileInputStream("filename"), "UTF-8");

There's no way to tell FileReader the encoding, alas.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[EFH]: There's no way to tell FileReader the encoding, alas.

True. Though since JDK 5, it's been possible to use a Scanner instead, which allows you to specify an encoding quite easily.

However, since the goal is to read an XML file, I think it would probably be more useful to use an existing parser, such as Xerces or JDOM. XML parsers are responsible for reading the encoding specified within the document, and using it. As well as for handling many other tasks which are probably more trouble than they're worth. There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.
 
Paul Clapham
Sheriff
Posts: 21107
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jim Yingst:
There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.
That's the ideal way to do it; give the parser a stream of bytes and let it deal with it in the standard XML way. But Anupam said
Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
And the problem occurs when the reader has been created with an encoding that conflicts with the document's real encoding. So I would suggest that design is a problem. It should be fixed by changing the design to accept an InputStream.

I realize that it's common for lower-level staff to be made to work with "fait accompli" designs like this. Often this results in their producing convoluted work-arounds that were not foreseen by the designers. But I would prefer well-written software to software that places more importance on the office power structure. So: get the design changed.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic