File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes UTF8 character soup Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » Java in General
Reply locked New topic
Author

UTF8 character soup

Thomas Goorden
Ranch Hand

Joined: Aug 15, 2001
Posts: 39
Little outline:
I've got an XML file, created via some Microsoft ADO function (not my choice!). I would like to get this file parsed through the apache SAX parser. However, this breaks on certain characters. I've been able to narrow it down to VT (vertical tab / #0Bh) characters that are somehow present in the XML. The SAX parser simply breaks on those:
I tried to get those characters out of the file, but now I get:
Would I need to replace those characters by something else (I already tried a space, didn't work)?
I know it's not strictly a java problem, but since there's noone else that can help, maybe I can hire...
Update: Curiously enough, if I read the original file, java regards it as a (Windows) Cp1252 encoded file, while the microsoft specs promises UTF-8 encoding...
[ April 18, 2002: Message edited by: Thomas Goorden ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18652
Those characters are illegal in XML documents, and a proper XML parser is required to throw an error on them. You'll have to read the file through another mechanism first, and replace all the offending characters. You can get a complete list of illegal chars from the XML specification.
Update: Curiously enough, if I read the original file, java regards it as a (Windows) Cp1252 encoded file, while the microsoft specs promises UTF-8 encoding...
If you read any file in Java with a FileReader or other common mechanisms, and you do not specify the encoding used (typically using an InputStreamReader) the JVM will assume the system default encoding, which is Cp1252 in your case.


"I'm not back." - Bill Harding, Twister
James Swan
Ranch Hand

Joined: Jun 26, 2001
Posts: 403
There was a thread posted in the XML forum that was similar to your issue.
You could check out some code there.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18652
Heh. That thread was also originally posted by Thomas Goorden, and replied to by me, and then by you. Just like this one. I'm closing this thread - follow up here.
 
 
subject: UTF8 character soup
 
Threads others viewed
SAXException: Invalid byte 2 of 2-byte UTF-8 sequence
Lost in a soup of encoding problems...
Need to add an invalid UTF-8 character
multiple language support in one XML
Encoding troubles when applying stylesheets through Xalan
MyEclipse, The Clear Choice

cast iron skillet 49er

more from paul wheaton's glorious empire of web junk: cast iron skillet diatomaceous earth rocket mass heater sepp holzer raised garden beds raising chickens lawn care CFL flea control missoula heat permaculture