Little outline: I've got an XML file, created via some Microsoft ADO function (not my choice!). I would like to get this file parsed through the apache SAX parser. However, this breaks on certain characters. I've been able to narrow it down to VT (vertical tab / #0Bh) characters that are somehow present in the XML. The SAX parser simply breaks on those: I tried to get those characters out of the file, but now I get: Would I need to replace those characters by something else (I already tried a space, didn't work)? I know it's not strictly a java problem, but since there's noone else that can help, maybe I can hire... Update: Curiously enough, if I read the original file, java regards it as a (Windows) Cp1252 encoded file, while the microsoft specs promises UTF-8 encoding... [ April 18, 2002: Message edited by: Thomas Goorden ]
Jim Yingst
Wanderer
Sheriff
Joined: Jan 30, 2000
Posts: 18652
posted
0
Those characters are illegal in XML documents, and a proper XML parser is required to throw an error on them. You'll have to read the file through another mechanism first, and replace all the offending characters. You can get a complete list of illegal chars from the XML specification. Update: Curiously enough, if I read the original file, java regards it as a (Windows) Cp1252 encoded file, while the microsoft specs promises UTF-8 encoding... If you read any file in Java with a FileReader or other common mechanisms, and you do not specify the encoding used (typically using an InputStreamReader) the JVM will assume the system default encoding, which is Cp1252 in your case.
"I'm not back." - Bill Harding, Twister
James Swan
Ranch Hand
Joined: Jun 26, 2001
Posts: 403
posted
0
There was a thread posted in the XML forum that was similar to your issue. You could check out some code there.
Jim Yingst
Wanderer
Sheriff
Joined: Jan 30, 2000
Posts: 18652
posted
0
Heh. That thread was also originally posted by Thomas Goorden, and replied to by me, and then by you. Just like this one. I'm closing this thread - follow up here.