• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Rob Spoor
  • Bear Bibeault
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Piet Souris
  • Frits Walraven
  • Himai Minh

Invalid Character inside CDATA

Posts: 13
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm parsing an XML file that contains some Japanese language (UTF-8 chars). During the parsing, I received an error that says
"An invalid XML character (Unicode: 0xb4) was found in the CDATA section."
Can someone explain to me how does it possible to have an invalid XML character inside CDATA section? I believe the only restriction inside the CDATA section is including "]]" inside the message.
Thank you
Leverager of our synergies
Posts: 10065
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Everybody believes so, yet it is a mistake. I think, the confusion stems from many, many sources of XML wisdom, which define CDATA section as "data that are ignored by the parser". If CDATA is ignored, we can put everything there, including binary data?
Nothing in XML specification suggest it. "CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup." And if you look at how CDATA is defined, you'll see
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>
Where "Char" is in the same range as in any other part of XML document:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
This means that CDATA is different from parsed data only in that the markup is not recognized as such, i.e. not parsed.
My understanding is that XML document "physically" can consist of legal characters only; this layer has the highest priority, and high-level constructs like CDATA have to obey the rules. One way to circumvent this rule and to include illegal characters would be to code your data in base64, but this will increase document's size, violate all good design rules etc. etc.
[ May 09, 2002: Message edited by: Mapraputa Is ]
I hired a bunch of ninjas. The fridge is empty, but I can't find them to tell them the mission.
Thread Boost feature
    Bookmark Topic Watch Topic
  • New Topic