aspose file tools*
The moose likes XML and Related Technologies and the fly likes Removing Character References from Attributes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Removing Character References from Attributes" Watch "Removing Character References from Attributes" New topic
Author

Removing Character References from Attributes

Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
When using Xerces, is there an efficient way to remove character references from attribute values? Below is an example:

<narrative text="There are some sentences here and these characters we want out&#xD;&#xA;Some more stuff here.">

I'm looking to parse the XML document and replace the "&#xD;&#xA;" with spaces.


Thanks,

James
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18909
    
    8

Seems to me that Xerces should do that without being asked. That's Attribute-Value Normalization. Are you saying it doesn't do that?

Umm... reading that section of the XML recommendation again, it appears that character references (as opposed to characters) are immune from the normalization rules. In which case you would indeed have to replace them yourself. Something like this?
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Thanks Paul. That is what I found too. I was just hoping that I missed something or that there was some setting somewhere. We are working with very large multi-MB files and I am trying to avoid assigning String manipulation/comparision routines when processing attributes.

Here is a good example of poor XML design I think. Attributes should not have text (sentences) as values. In these cases, text should be element content, not an attribute value. Unfortunately, the XML design is out of our control.

I'm thinking that we might use UNIX Sed/Awk program to read through file and replace these in XML document before sending to Xerces parse routine. Not sure how big an issue this is right now.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Removing Character References from Attributes