File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes XML and Related Technologies and the fly likes Invalid Character inside CDATA Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Invalid Character inside CDATA" Watch "Invalid Character inside CDATA" New topic

Invalid Character inside CDATA

Donny Wi

Joined: Jan 24, 2002
Posts: 13
I'm parsing an XML file that contains some Japanese language (UTF-8 chars). During the parsing, I received an error that says
"An invalid XML character (Unicode: 0xb4) was found in the CDATA section."
Can someone explain to me how does it possible to have an invalid XML character inside CDATA section? I believe the only restriction inside the CDATA section is including "]]" inside the message.
Thank you

Donny Widjaja
Mapraputa Is
Leverager of our synergies

Joined: Aug 26, 2000
Posts: 10065
Everybody believes so, yet it is a mistake. I think, the confusion stems from many, many sources of XML wisdom, which define CDATA section as "data that are ignored by the parser". If CDATA is ignored, we can put everything there, including binary data?
Nothing in XML specification suggest it. "CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup." And if you look at how CDATA is defined, you'll see
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>
Where "Char" is in the same range as in any other part of XML document:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
This means that CDATA is different from parsed data only in that the markup is not recognized as such, i.e. not parsed.
My understanding is that XML document "physically" can consist of legal characters only; this layer has the highest priority, and high-level constructs like CDATA have to obey the rules. One way to circumvent this rule and to include illegal characters would be to code your data in base64, but this will increase document's size, violate all good design rules etc. etc.
[ May 09, 2002: Message edited by: Mapraputa Is ]

Uncontrolled vocabularies
"I try my best to make *all* my posts nice, even when I feel upset" -- Philippe Maquet
I agree. Here's the link:
subject: Invalid Character inside CDATA
It's not a secret anymore!