File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Xerces Sax not parsing a Unicode char Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Xerces Sax not parsing a Unicode char" Watch "Xerces Sax not parsing a Unicode char" New topic
Author

Xerces Sax not parsing a Unicode char

karen obrien
Greenhorn

Joined: Apr 01, 2002
Posts: 8
My SaxParser (xerces) is failing when parsing, complaining about Unicode: 0x1d.
I am reading from a file (InputSource), and have set the encoding to UTF-8.
Anyone have any ideas?
Thanks a million!
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
"This character cannot be used in XML documents"
Zvon.org
karen obrien
Greenhorn

Joined: Apr 01, 2002
Posts: 8
Is there any way to not parse data in specified xml elements? Without explicitly escaping the illegal character....
Thanks.
karen obrien
Greenhorn

Joined: Apr 01, 2002
Posts: 8
Is there any way to not parse data in specified xml elements? Without explicitly escaping the illegal character....
Thanks.
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
CDATA sections are not parsed by the parser.
<someTag>
<![CDATA[
some content goes here
]]>
</someTag>
I am not sure illegal symbols are allowed in CDATA sections, though.
What does this symbol represent in your data? If it's part of binary data, maybe you should probably use Base64 encoding to include them in an XML document.
karen obrien
Greenhorn

Joined: Apr 01, 2002
Posts: 8
Thanks for your help, Mapraputa.
I had already tried enclosing the offending text in a CDATA tag, but the parser still complains.
The character itself is : ∝ , and I'm sure there must be some way the parser can avoid parsing it?
Thanks again.
karen obrien
Greenhorn

Joined: Apr 01, 2002
Posts: 8
Having performed some research, I discovered that this is a control character and while it is an acceptable Unicode character, it is not a valid UTF-8 character.
Control characters are in the range U+0000....U+001F, and most of them are written out as '?'. 0x1d(Group Separator), however, is not escaped and therefore Xerces cannot parse it.
I have written a util class that escapes control chars in Unicode and this resolved my problem.<br>
Thanks for all your help.
shekar rakju
Greenhorn

Joined: Nov 24, 2003
Posts: 15
Hi,
I am facing the similar problem, could you please give me the solution
Shekar
my id is : pandresatya@indiatimes.com
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Xerces Sax not parsing a Unicode char