This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes XML and Related Technologies and the fly likes Removing Character References from Attributes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Removing Character References from Attributes" Watch "Removing Character References from Attributes" New topic

Removing Character References from Attributes

Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
When using Xerces, is there an efficient way to remove character references from attribute values? Below is an example:

<narrative text="There are some sentences here and these characters we want out&#xD;&#xA;Some more stuff here.">

I'm looking to parse the XML document and replace the "&#xD;&#xA;" with spaces.


Paul Clapham

Joined: Oct 14, 2005
Posts: 18541

Seems to me that Xerces should do that without being asked. That's Attribute-Value Normalization. Are you saying it doesn't do that?

Umm... reading that section of the XML recommendation again, it appears that character references (as opposed to characters) are immune from the normalization rules. In which case you would indeed have to replace them yourself. Something like this?
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Thanks Paul. That is what I found too. I was just hoping that I missed something or that there was some setting somewhere. We are working with very large multi-MB files and I am trying to avoid assigning String manipulation/comparision routines when processing attributes.

Here is a good example of poor XML design I think. Attributes should not have text (sentences) as values. In these cases, text should be element content, not an attribute value. Unfortunately, the XML design is out of our control.

I'm thinking that we might use UNIX Sed/Awk program to read through file and replace these in XML document before sending to Xerces parse routine. Not sure how big an issue this is right now.
I agree. Here's the link:
subject: Removing Character References from Attributes
Similar Threads
SOAP filtering out my carriage returns???
How to preserve new lines when parsing
XML / Java / Japanese Characters
parsing data and storing in the xml
How to preserve space when parsing XML with Xalan