I am using DOM parser for parsing aboyt 50000 XMLs.
Call to parse function : -
Problem is with special characters. Since our site is supporting Arabic language, characters like , causes problem in parser.
I cannot use InputStream instead of new File() because of performance issues.
Is there some other way to set some encoding format to the file, without opening it? Or how can i remove invalid characters while XML parsing?
dipti khullar wrote:I cannot use InputStream instead of new File() because of performance issues.
Nonsense. In the end, the file is still completely read, and I just can't believe that the file is read that much more efficiently. In the end, all bytes still have to be read.
That said, DocumentBuilder has a parse method that takes an InputSource. InputSource can be constructed using a Reader, like a FileReader or InputStreamReader. With the latter you can provide the encoding:
In the end, all you need to do is change the encoding on line 2.
The part about InputStream versus File for performance is probably bunk, I agree. But if you pass either of those two things to an XML parser, it will determine the document's encoding from the document itself in the way specified in the XML Recommendation. There's no need to make your own Reader to specify the encoding; in fact this will backfire if you specify the wrong encoding.
The problem is this: the actual error message says
Character reference "
" is an invalid XML character.
This is true, and it has nothing to do with Arabic because that character is nowhere near the Arabic ranges of Unicode. It's the ASCII "form feed" character and it just isn't allowed in an XML document. It's better not to try to delete those characters before you parse the document; the correct approach would be to not put them into the document in the first place. So contact whoever produced the document and explain the problem.
The XML Recommendation is here and you might want to read the relevant sections (chapter 2.2 for example) so you understand them before you try telling somebody else their XML document is malformed (which it is).