aspose file tools*
The moose likes XML and Related Technologies and the fly likes Exception in XML Parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Exception in XML Parsing" Watch "Exception in XML Parsing" New topic
Author

Exception in XML Parsing

dipti khullar
Greenhorn

Joined: Oct 23, 2008
Posts: 6
I am using DOM parser for parsing aboyt 50000 XMLs.

Call to parse function : -


Problem is with special characters. Since our site is supporting Arabic language, characters like , causes problem in parser.
Exception:



I cannot use InputStream instead of new File() because of performance issues.
Is there some other way to set some encoding format to the file, without opening it? Or how can i remove invalid characters while XML parsing?
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19552
    
  16

dipti khullar wrote:I cannot use InputStream instead of new File() because of performance issues.

Nonsense. In the end, the file is still completely read, and I just can't believe that the file is read that much more efficiently. In the end, all bytes still have to be read.

That said, DocumentBuilder has a parse method that takes an InputSource. InputSource can be constructed using a Reader, like a FileReader or InputStreamReader. With the latter you can provide the encoding:
In the end, all you need to do is change the encoding on line 2.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18151
    
    8

The part about InputStream versus File for performance is probably bunk, I agree. But if you pass either of those two things to an XML parser, it will determine the document's encoding from the document itself in the way specified in the XML Recommendation. There's no need to make your own Reader to specify the encoding; in fact this will backfire if you specify the wrong encoding.

The problem is this: the actual error message says

Character reference "&#12 " is an invalid XML character.


This is true, and it has nothing to do with Arabic because that character is nowhere near the Arabic ranges of Unicode. It's the ASCII "form feed" character and it just isn't allowed in an XML document. It's better not to try to delete those characters before you parse the document; the correct approach would be to not put them into the document in the first place. So contact whoever produced the document and explain the problem.

The XML Recommendation is here and you might want to read the relevant sections (chapter 2.2 for example) so you understand them before you try telling somebody else their XML document is malformed (which it is).
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Exception in XML Parsing
 
Similar Threads
Counting newline characters in BufferedReader...
Validation Special Character Problem
Need to add an invalid UTF-8 character
XML Parsing with international characters
Unicode parsing exception