• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

Exception in XML Parsing

 
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am using DOM parser for parsing aboyt 50000 XMLs.

Call to parse function : -


Problem is with special characters. Since our site is supporting Arabic language, characters like , causes problem in parser.
Exception:



I cannot use InputStream instead of new File() because of performance issues.
Is there some other way to set some encoding format to the file, without opening it? Or how can i remove invalid characters while XML parsing?
 
Sheriff
Posts: 22001
107
Eclipse IDE Spring VI Editor Chrome Java Ubuntu Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

dipti khullar wrote:I cannot use InputStream instead of new File() because of performance issues.


Nonsense. In the end, the file is still completely read, and I just can't believe that the file is read that much more efficiently. In the end, all bytes still have to be read.

That said, DocumentBuilder has a parse method that takes an InputSource. InputSource can be constructed using a Reader, like a FileReader or InputStreamReader. With the latter you can provide the encoding:
In the end, all you need to do is change the encoding on line 2.
 
Marshal
Posts: 25829
69
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The part about InputStream versus File for performance is probably bunk, I agree. But if you pass either of those two things to an XML parser, it will determine the document's encoding from the document itself in the way specified in the XML Recommendation. There's no need to make your own Reader to specify the encoding; in fact this will backfire if you specify the wrong encoding.

The problem is this: the actual error message says

Character reference "&#12 " is an invalid XML character.



This is true, and it has nothing to do with Arabic because that character is nowhere near the Arabic ranges of Unicode. It's the ASCII "form feed" character and it just isn't allowed in an XML document. It's better not to try to delete those characters before you parse the document; the correct approach would be to not put them into the document in the first place. So contact whoever produced the document and explain the problem.

The XML Recommendation is here and you might want to read the relevant sections (chapter 2.2 for example) so you understand them before you try telling somebody else their XML document is malformed (which it is).
 
The airline is called "Virgin"? Don't you want a plane to go all the way? This tiny ad will go all the way:
the value of filler advertising in 2020
https://coderanch.com/t/730886/filler-advertising
    Bookmark Topic Watch Topic
  • New Topic