Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
The moose likes XML and Related Technologies and the fly likes Exception in XML Parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Exception in XML Parsing" Watch "Exception in XML Parsing" New topic

Exception in XML Parsing

dipti khullar

Joined: Oct 23, 2008
Posts: 6
I am using DOM parser for parsing aboyt 50000 XMLs.

Call to parse function : -

Problem is with special characters. Since our site is supporting Arabic language, characters like , causes problem in parser.

I cannot use InputStream instead of new File() because of performance issues.
Is there some other way to set some encoding format to the file, without opening it? Or how can i remove invalid characters while XML parsing?
Rob Spoor

Joined: Oct 27, 2005
Posts: 20269

dipti khullar wrote:I cannot use InputStream instead of new File() because of performance issues.

Nonsense. In the end, the file is still completely read, and I just can't believe that the file is read that much more efficiently. In the end, all bytes still have to be read.

That said, DocumentBuilder has a parse method that takes an InputSource. InputSource can be constructed using a Reader, like a FileReader or InputStreamReader. With the latter you can provide the encoding:
In the end, all you need to do is change the encoding on line 2.

How To Ask Questions How To Answer Questions
Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

The part about InputStream versus File for performance is probably bunk, I agree. But if you pass either of those two things to an XML parser, it will determine the document's encoding from the document itself in the way specified in the XML Recommendation. There's no need to make your own Reader to specify the encoding; in fact this will backfire if you specify the wrong encoding.

The problem is this: the actual error message says

Character reference "&#12 " is an invalid XML character.

This is true, and it has nothing to do with Arabic because that character is nowhere near the Arabic ranges of Unicode. It's the ASCII "form feed" character and it just isn't allowed in an XML document. It's better not to try to delete those characters before you parse the document; the correct approach would be to not put them into the document in the first place. So contact whoever produced the document and explain the problem.

The XML Recommendation is here and you might want to read the relevant sections (chapter 2.2 for example) so you understand them before you try telling somebody else their XML document is malformed (which it is).
I agree. Here's the link:
subject: Exception in XML Parsing
It's not a secret anymore!