This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Java in General and the fly likes NLS characters lost when storing xml from java to filesystem Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "NLS characters lost when storing xml from java to filesystem" Watch "NLS characters lost when storing xml from java to filesystem" New topic
Author

NLS characters lost when storing xml from java to filesystem

Anupam Bhatt
Ranch Hand

Joined: Mar 12, 2004
Posts: 81
Hi,

I am storing a XML document which has some non-ascii characters, from java to filesytem. I specify the XML document's encoding as UTF-8 and then save it to the filesystem.
But when i retrieve the document back in to java i find all the non-ascii characters lost and represented as [???].

What can be done in such scenarios? My guess is the hosting machine where the XML document is stored, should have default encoding as UTF-8 ? Please comment and guide.

If the above is true, it would be a pain to do these settings on all the machine where the application is run.

I expect there should be a easier solution for this. Any ideas, help is appreciated.

Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
[ June 14, 2007: Message edited by: Anupam Bhatt ]
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

The Reader needs to know the encoding if UTF-8 isn't the platform default; it won't learn this from the XML file itself. For example, you might use

Reader rdr = new InputStreamReader(new FileInputStream("filename"), "UTF-8");

There's no way to tell FileReader the encoding, alas.


[Jess in Action][AskingGoodQuestions]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[EFH]: There's no way to tell FileReader the encoding, alas.

True. Though since JDK 5, it's been possible to use a Scanner instead, which allows you to specify an encoding quite easily.

However, since the goal is to read an XML file, I think it would probably be more useful to use an existing parser, such as Xerces or JDOM. XML parsers are responsible for reading the encoding specified within the document, and using it. As well as for handling many other tasks which are probably more trouble than they're worth. There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.


"I'm not back." - Bill Harding, Twister
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Originally posted by Jim Yingst:
There's no need to assume that the documents will all use UTF-8; let the document specify it, and let the parser parse it.
That's the ideal way to do it; give the parser a stream of bytes and let it deal with it in the standard XML way. But Anupam said
Just to mention, this is all in case when the input to my api's is a 'reader' and i read the xml from the reader. I do not have control over this.
And the problem occurs when the reader has been created with an encoding that conflicts with the document's real encoding. So I would suggest that design is a problem. It should be fixed by changing the design to accept an InputStream.

I realize that it's common for lower-level staff to be made to work with "fait accompli" designs like this. Often this results in their producing convoluted work-arounds that were not foreseen by the designers. But I would prefer well-written software to software that places more importance on the office power structure. So: get the design changed.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: NLS characters lost when storing xml from java to filesystem
 
Similar Threads
XMLSerializer/encoding
Entities in attribute values issue in Sax parser
XML parse error
Setting encoding in web.xml
java io UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence