permaculture playing cards*
The moose likes XML and Related Technologies and the fly likes Clarification about CDATA Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Clarification about CDATA" Watch "Clarification about CDATA" New topic
Author

Clarification about CDATA

Sanjay Mishra
Ranch Hand

Joined: Jul 08, 2000
Posts: 84
Hello,
The following line is the only line of a dtd file.
<!ELEMENT test (#CDATA) >
While I try to test the validness or well formedness of this file
using xml spy I get the error "this file is not well formed % expected".
If I change the CDATA to PCDATA the error goes off.
Does it mean that the elements can only contain either PCDATA or other elements (children), but not CDATA?
Please explain ..
Sanjay

Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Sanjay, it�s a DTD's point of confusion, because DTD terminilogy is ambiguous (in my possible wrong opinion)
We have two concepts of character data in XML:
PCDATA - Parsable Character Data
CDATA � data that are not parsed by the parser. To identify such data in a document content we use CDATA section
<![CDATA[
< & and all other forbidden symbols can appear here
]]>
Now, in DTD we have #PCDATA and CDATA keywords, which in my understanding have little to do with two previous concepts, since they both means parsable data.
In element declaration we can use only #PCDATA keyword:
<!ELEMENT someElement (#PCDATA)>
in attribute declaration we use only CDATA keyword:
<!ATTLIST someElement someAttribute CDATA #IMPLIED >
We cannot use < or & symbols in an attribute declared as CDATA, inspite of the fact, that the attribute type gives a wrong hint that we can
Sorry for delay with answering

Uncontrolled vocabularies
"I try my best to make *all* my posts nice, even when I feel upset" -- Philippe Maquet
Meadowlark Bradsher
Ranch Hand

Joined: Jan 23, 2001
Posts: 109
Mapraputa, I believe PCDATA stands for "parsed character data" not "parseable character data". I'm a self admitted novice with XML but what that slight difference in terminology suggests to me (because it is in the past tense) is that the parser need not parse it because it is already parsed, perhaps by a human. I know it seems strange to define it that way but PCDATA actually isn't parsed after creation except to find the end tag (or a beginning tag of another element in the case of mixed data). On the other hand you can say that it is actually parsed so that it is separated from the elements that contain it and yet still defined by that surrounding context. But it is parsed only to that extent and furthermore elements, attributes and general entities are not identified as parsed character data. Only the text is.
I'm just guessing about the meaning of it but the definition for PCDATA, I find, is in the past tense.
It is an annoyance that CDATA is a synonym for the most basic Attribute Type and also at the same time as a section of unparseable code. There must be some reason for that because the specification seems carefully designed. Perhaps the fact that neither have value to the parser (except as to establish uniqueness) may be the cause. To an application that uses a parser, attributes may have all kinds of meaning but in parsing itself they don't affect the DOM the same way that elements do. Attributes like all else are Nodes but as the DOM specification says ( http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-637646024 ) about attributes also known as "the Attr Interface", for the purpose of the DOM...
"Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. Thus, the Node attributes parentNode, previousSibling, and nextSibling have a null value for Attr objects. The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with; this should make it more efficient to implement such features as default attributes associated with all elements of a given type."
Elements may be nested or side by side, appear multiple times or only once, and in all these forms they create the document structure. Perhaps this is the view that the W3C is attempting to express when establishing the semantics for this specification. Perhaps they are expressing that if it is parseable character data (elements) then it is meant to describe by means of its structure some other character data that is already processed and thus already meaningful (PCDATA) and uses unparsed character data (CDATA) to describe the parseable data with name = "value" pairs (attributes) that aren't intended to reflect on the document's structure (even though it might) and are therefore meaningless to the parser creating the DOM tree in defining the document structure (even though it may have meaning to the application that uses the DOM tree as well as human readers. That in a sense does however still affect the way applications and humans think of the document structure, it should be so noted). Whew! I guess essentially I am trying to draw the conclusion that parsing means in this context that it is created directly into and/or from a DOM tree, otherwise it is not considered parseable. Thus CDATA merely characterizes unparseable data whether it being used to describe name = "value" pairs or just represents data hitchhiking on the back of the XML and DOM documents through the parsing process (as CDATA sections - the "<[CDATA[]]>" tags do) on the way to and from applications and human viewing.
Sorry. I had no idea it was going to take this long to express my theory. By no means do I feel I am absolutely correct just because I used alot of words. Feel free to illustrate the obvious flaw/s in my logic (if there is one/ some).
Sorry again for the long windedness
-Meadowlark Bradsher
Sun Certified Programmer for the Java�2 Platform.
IBM Certified Developer - XML and Related Technologies, V1.


[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]
[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]


Meadowlark Bradsher
SCJ2P, IBM XML V1, Series 7/63
Meadowlark Bradsher
Ranch Hand

Joined: Jan 23, 2001
Posts: 109
Wow! I see from the number of times it indicates above that I edited that previous post that I should have just been really careful and edited it perfectly the first time! I didn't know the message board was going to give away my "filtering" style of editing.

Ah well. You live and learn.
Meadowlark Bradsher
Sun Certified Programmer for the Java�2 Platform.
IBM Certified Developer - XML and Related Technologies, V1.
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
HI Meadowlark!
Mapraputa, I believe PCDATA stands for "parsed character data" not "parseable character data".
You are right. I borrowed PCDATA definition from Norman Walsh and he probably used it in non-strict sense.
but what that slight difference in terminology suggests to me (because it is in the past tense) is that the parser need not parse it because it is already parsed, perhaps by a human.
Emmm... I viewed parsed data as opposed to unparsed. Uparsed data are sent to an application �as is�, without any change. If data are processed by a parser and output may be different from input, than this data are... parsed? I would also use �parseable� here � it gives a clue that they are subject for parsing, regardless if they have anything to actually parse or not.
But after some more reading I came to the conclusion that there are at least three ways parser can handle input text:
1) CDATA sections � are sent to an application as is. No character or general entities are resolved, no whitespace stripped
2) attribute value normalization � that�s how W3C calls it. Normalization, not parsing. Character and general entities are resolved. Any leading and trailing space are discarded and sequences of space chars are replaced by one space char unless attribute is declared as CDATA � then whitespace is preserved
3) element content parsing (warning: the wording is mine) � the same like in case with non-CDATA attributes + tags are recognized
As you said the difference between 2 and 3 may be that only in the last case DOM three is affected.
It is an annoyance that CDATA is a synonym for the most basic Attribute Type and also at the same time as a section of unparseable code. There must be some reason for that because the specification seems carefully designed.
I agree with you 100% here. I also suspected there should be good reasons for such �ambiguity� and wondered what they are.
I though that it would be better if there was a special term for case 2 in DTD, and first that came to my mind was NDATA � normalized data, but this term is already in use Ok, it could be NORMDATA � ugly, but no confusion with genuine CDATA.
Now good news Meadowlark, you do not need to be really careful and edit your posts perfectly the first time! You can be as careless as Mapraputa, for example, just do not forget to erase all previous lines �[This message has been edited by Meadowlark Bradsher (edited May 04, 2001).]� in your post when you edit it again. This way your post will have at most one sign of editing.
Thanks to you both, Meadowlark and Sanjay � I hope we help each other to make things clear. Or to confuse each other
Meadowlark Bradsher
Ranch Hand

Joined: Jan 23, 2001
Posts: 109
Mapraputa,
Thanks for taking the time to involve yourself in a response. To me this is living proof that expressing my opinions or thoughts is even better than asking questions when it comes to different methods of learning, when you are in the right company that is. Often times I find that many people refuse to challenge or question what I've said, and I am sure I am not alone in that experience. So I often feel starved for productive discourse, and that it happens that way, I feel, needlessly.
Well..

Emmm... I viewed parsed data as opposed to unparsed. Uparsed data are sent to an application �as is�, without any change. If data are processed by a parser and output may be different from input, than this data are... parsed? I would also use �parseable� here � it gives a clue that they are subject for parsing, regardless if they have anything to actually parse or not.
Perhaps the tense of the adjective (parsed or parseable) is not very significant. It was probably a naive point, but at least it served as a starting point for me to look into this question. Certainly PCDATA needs to be parsed to see if it contains markup if for that reason alone. Also with two identical tags (i.e. <name>Meadowlark</name><name>Mapraputa</name> ) the PCDATA (I have to be careful not to say "text") identifies the uniqueness of these tags (because they have different values in them). That seems like a parseable event.
Here is a couple of definitions for "text", "markup" and especially "character data", in the XML specification (I actually took it from the annotated version at www.xml.com) ..
http://www.xml.com/axml/target.html#syntax
2.4 Character Data and Markup
Text consists of intermingled character data and markup.
[Definition:] Markup takes the form <notes/MarkupDelim.html> <notes/MarkupDelim.html> of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions.
[Definition:] All text that is not markup constitutes the character data of the document.

That doesn't help concretely define the secret of CDATA's dual definitions, but it does assist in making sure that "character data" (unless God forbid it also has dual definitions!) does not refer to markup. Perhaps to define markup as parsed or non-parsed would be moot because it is fundamental to the purpose of XML documents that it is parsed.
2) attribute value normalization � that�s how W3C calls it. Normalization, not parsing. Character and general entities are resolved. Any leading and trailing space are discarded and sequences of space chars are replaced by one space char unless attribute is declared as CDATA � then whitespace is preserved.
Certainly to normalize the attribute, the parser must technically perform some form of parsing but not in the XML specific definition.
Here's one of Tim Bray's annotations to the XML specification. It's an annotation (you have to click on the annotation symbol as a link. I'm sure you know that but for others..) found on the first line of the second paragraph of this anchor.
http://www.xml.com/axml/target.html#sec-intro
Parsed and Unparsed
The use of the terms "parsed" and "unparsed" for entities may seem a bit on the obscure side. SGML uses the terms "text" and "data" for the same purposes, but we found that misleading, because it's all, once you get right down to it, data, and furthermore, some of the "data" entities might contain text. The only real difference between the two kinds of entities is whether an XML processor has to try and parse them or not; hence the names.
Another benefit of using "parsed" and "unparsed" is that this frees up the useful word text, which really ought to mean something in the XML context.
Back-link to spec
Copyright � 1998, Tim Bray. All rights reserved

All this still brings me about an inch short (or 2.54 centimeters) of being able to conclude that this explains the convention of using CDATA in these two forms.
However I believe it does show that this elaboration you made on your initial response Sanjay's question is possibly inaccurate according to the XML Specification..
Now, in DTD we have #PCDATA and CDATA keywords, which in my understanding have little to do with two previous concepts, since they both means parsable data.
As you stated, CDATA in a DTD is normalized, thus probably not defined as parsed.
Forgive me if I state the obvious in some areas as though it wasn't obvious. I am really trying to unravel for myself some of the mystique of XML's design. I can just follow the rules that the design establishes without questioning them and get along just fine but the comfort would come from the routine of following instead of the understanding of following. When understanding seems to be available I would naturally choose that.
Thanks again for clarifying some things for me, Mapraputa, and questioning some things I wrote. I hope that you or anyone else will always feel comfortable and compelled to do so. As I said it seems like the #1 fastest way for me to learn anything new.

Meadowlark Bradsher
Sun Certified Programmer for the Java�2 Platform.
IBM Certified Developer - XML and Related Technologies, V1.
P.S. Thanks for the tip. Now you don't know how many times I have edited this thing. (HINT: this time it would probably have been double digits. Mum's the word, right? )
P.P.S. Do you think I should drop the titles? I am actually in the market to find work and I was hoping to use them as bait.

[This message has been edited by Meadowlark Bradsher (edited May 05, 2001).]
Meadowlark Bradsher
Ranch Hand

Joined: Jan 23, 2001
Posts: 109
I wrote this e-mail to Tim Bray. He's one of the three editors on the XML specification and also writer of XML.Com's annotated XML specification.
Mr. Bray,

Perhaps you are busy but I was wondering if you would have a moment to clarify a little confusion occurring in a public forum. The confusion is over the use of the term "CDATA" as an attribute type. What design reason influenced its synonymous use with the description of an unparsed CDATA section?
http://www.javaranch.com/ubb/Forum31/HTML/000671.html


Thank you for your attention,
Meadowlark Bradsher

This was his response.
The terminology (which is indeed confusing) was inherited from
XML's ancestor, SGML. -Tim

I feel kind of dumb for hypothesizing so.
Life goes on.
-Meadowlark Bradsher
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
Ha, Meadowlark, while Mapraputa was writing her response, you already cleared the problem! If you are interested in XML standards, there is a very good conference, XML-DEV: http://lists.xml.org/archives/xml-dev/
And on xml.com they have a nice weekly survey for this conference, XML-Deviant column: http://www.xml.com/pub/q/xmldeviant � a good place to start. Maybe you already know about them.
I agree that such discussion groups are very effective way of learning, the most effective I am aware of. During my SCJP preparations, I found JavaRanch where people like Ajith showed us what learning is about, and since then I am stuck here
Thanks for mentioning annotated specification! I found it accidentally and liked it very much - comments are entertaining Unfortunately later I forgot about it. Should better post a message, so everybody can enjoy it.
The titles should be OK, after all, our sheriff Ajith also use them although he is not in the job market. Or maybe he is, in some sense we all are constantly in the job market
Ajith Kallambella
Sheriff

Joined: Mar 17, 2000
Posts: 5782
Great discussion! I like it when people think aloud and you have no idea how many people are silently reading your writings.
About the titles, it is perfectly okay to have them ( as long as they are genuine ). Whether you're in the job market or not, they are like authenticity stamps. We at Javaranch often boast about the quality of visitors and certification titles definitely make us feel proud!
Cheers!
Ajith


Open Group Certified Distinguished IT Architect. Open Group Certified Master IT Architect. Sun Certified Architect (SCEA).
Meadowlark Bradsher
Ranch Hand

Joined: Jan 23, 2001
Posts: 109
Wow, Mapraputa, these 2 links are a great resource! No I wasn't aware of them. I saw XML Deviant on XML.com but I hadn't learned what it was until now. Thanks alot!
In a discussion information that is just in the corner of your peripheral vision falls in the direct gaze of someone else and you really get a 360 degree view in the end of what you know and what you think you know.
Thanks to you too Ajith, for your thoughtful appreciation.
How did I become a ranch hand by the way? Just by the number of times I have posted?
Anyway, I'm sure we'll talk again.
Meadowlark Bradsher
Sun Certified Programmer for the Java�2 Platform.
IBM Certified Developer - XML and Related Technologies, V1.
Sanjay Mishra
Ranch Hand

Joined: Jul 08, 2000
Posts: 84
Wow, It is much more than I asked for.
Thanks a lot, Mapraputa and Meadowlark.

This link will be a good reasource for someone who
needs a clarification about the "CDATA".
Also this link contains a neat way to make u'r posting neat,
no matter how many times u edit it.
What is the way to edit the subject? Many times I post the message with wrong spelling in the subject which is very annoying.

Thanks again
Sanjay
R K Singh
Ranch Hand

Joined: Oct 15, 2001
Posts: 5371
Let me try to understand what is CDATA, PCDATA.
Please correct me if I am wrong.
CDATA = Character Data [which will not be parsed]
PCDATA = Parsed Character Data.
If we consider following element :

Then I think <chapter> should be PCDATA
and <para> should be CDATA.
As contents inside the <chapter> needs to be parsed by parser and contents inside the <para> are normal text which is not needed to be parsed.

The following line is the only line of a dtd file.
<!ELEMENT test (#CDATA) >

Please correct me if I am wrong.
'test' seems to be root element.
Root element, I think should be PCDATA, else what is the use of XML if root element itself contains non-parsed data.
Use of CDATA in case of attribute: [as Map has already said]
attribute value normalization – that’s how W3C calls it. Normalization, not parsing. Character and general entities are resolved. Any leading and trailing space are discarded and sequences of space chars are replaced by one space char unless attribute is declared as CDATA – then whitespace is preserved
I wish, I understood it correctly.


"Thanks to Indian media who has over the period of time swiped out intellectual taste from mass Indian population." - Chetan Parekh
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Clarification about CDATA
 
Similar Threads
question on mock exam
CDATA and PCDATA
#CDATA and #PCDATA
question on mock exam
Attribute and PCDATA