Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Xerces Sax not parsing a Unicode char

 
karen obrien
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My SaxParser (xerces) is failing when parsing, complaining about Unicode: 0x1d.
I am reading from a file (InputSource), and have set the encoding to UTF-8.
Anyone have any ideas?
Thanks a million!
 
Mapraputa Is
Leverager of our synergies
Sheriff
Posts: 10065
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"This character cannot be used in XML documents"
Zvon.org
 
karen obrien
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there any way to not parse data in specified xml elements? Without explicitly escaping the illegal character....
Thanks.
 
karen obrien
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there any way to not parse data in specified xml elements? Without explicitly escaping the illegal character....
Thanks.
 
Mapraputa Is
Leverager of our synergies
Sheriff
Posts: 10065
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
CDATA sections are not parsed by the parser.
<someTag>
<![CDATA[
some content goes here
]]>
</someTag>
I am not sure illegal symbols are allowed in CDATA sections, though.
What does this symbol represent in your data? If it's part of binary data, maybe you should probably use Base64 encoding to include them in an XML document.
 
karen obrien
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your help, Mapraputa.
I had already tried enclosing the offending text in a CDATA tag, but the parser still complains.
The character itself is : ∝ , and I'm sure there must be some way the parser can avoid parsing it?
Thanks again.
 
karen obrien
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Having performed some research, I discovered that this is a control character and while it is an acceptable Unicode character, it is not a valid UTF-8 character.
Control characters are in the range U+0000....U+001F, and most of them are written out as '?'. 0x1d(Group Separator), however, is not escaped and therefore Xerces cannot parse it.
I have written a util class that escapes control chars in Unicode and this resolved my problem.<br>
Thanks for all your help.
 
shekar rakju
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
I am facing the similar problem, could you please give me the solution
Shekar
my id is : pandresatya@indiatimes.com
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic