This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes XML and Related Technologies and the fly likes Removing Character References from Attributes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Removing Character References from Attributes" Watch "Removing Character References from Attributes" New topic
Author

Removing Character References from Attributes

Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
When using Xerces, is there an efficient way to remove character references from attribute values? Below is an example:

<narrative text="There are some sentences here and these characters we want out&#xD;&#xA;Some more stuff here.">

I'm looking to parse the XML document and replace the "&#xD;&#xA;" with spaces.


Thanks,

James
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Seems to me that Xerces should do that without being asked. That's Attribute-Value Normalization. Are you saying it doesn't do that?

Umm... reading that section of the XML recommendation again, it appears that character references (as opposed to characters) are immune from the normalization rules. In which case you would indeed have to replace them yourself. Something like this?
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Thanks Paul. That is what I found too. I was just hoping that I missed something or that there was some setting somewhere. We are working with very large multi-MB files and I am trying to avoid assigning String manipulation/comparision routines when processing attributes.

Here is a good example of poor XML design I think. Attributes should not have text (sentences) as values. In these cases, text should be element content, not an attribute value. Unfortunately, the XML design is out of our control.

I'm thinking that we might use UNIX Sed/Awk program to read through file and replace these in XML document before sending to Xerces parse routine. Not sure how big an issue this is right now.
 
wood burning stoves
 
subject: Removing Character References from Attributes