• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Devaka Cooray
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Jeanne Boyarsky
  • Tim Cooke
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Mikalai Zaikin
  • Carey Brown
Bartenders:

Exception in XML Parsing

 
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am using DOM parser for parsing aboyt 50000 XMLs.

Call to parse function : -


Problem is with special characters. Since our site is supporting Arabic language, characters like , causes problem in parser.
Exception:



I cannot use InputStream instead of new File() because of performance issues.
Is there some other way to set some encoding format to the file, without opening it? Or how can i remove invalid characters while XML parsing?
 
Sheriff
Posts: 22773
130
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

dipti khullar wrote:I cannot use InputStream instead of new File() because of performance issues.


Nonsense. In the end, the file is still completely read, and I just can't believe that the file is read that much more efficiently. In the end, all bytes still have to be read.

That said, DocumentBuilder has a parse method that takes an InputSource. InputSource can be constructed using a Reader, like a FileReader or InputStreamReader. With the latter you can provide the encoding:
In the end, all you need to do is change the encoding on line 2.
 
Marshal
Posts: 28005
94
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The part about InputStream versus File for performance is probably bunk, I agree. But if you pass either of those two things to an XML parser, it will determine the document's encoding from the document itself in the way specified in the XML Recommendation. There's no need to make your own Reader to specify the encoding; in fact this will backfire if you specify the wrong encoding.

The problem is this: the actual error message says

Character reference "&#12 " is an invalid XML character.



This is true, and it has nothing to do with Arabic because that character is nowhere near the Arabic ranges of Unicode. It's the ASCII "form feed" character and it just isn't allowed in an XML document. It's better not to try to delete those characters before you parse the document; the correct approach would be to not put them into the document in the first place. So contact whoever produced the document and explain the problem.

The XML Recommendation is here and you might want to read the relevant sections (chapter 2.2 for example) so you understand them before you try telling somebody else their XML document is malformed (which it is).
 
You don't know me, but I've been looking all over the world for. Thanks to the help from this tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic