File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Entities in attribute values issue in Sax parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Entities in attribute values issue in Sax parser" Watch "Entities in attribute values issue in Sax parser" New topic
Author

Entities in attribute values issue in Sax parser

John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Hi,
I am using a SAX Parser and to deal with entities I override the startEntity() and endEntity() method. The issue is that if the attribute values have entities like & mdash ; and & copy ; then the entities are not shown using the above two methods.
for e.g. <person name="John—Jai" /> then the long dash between John and Jai is not reported and hence it results in weird symbols in the output xml when i try to write it. Any help on how to handle entities present in attribute value would be helpful.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

Which SAX parser are you using? I ask because numerous bugs have been reported in the parsers built into some recent versions of Java, particularly in the area of attribute handling.

And when you said you overrode those methods, presumably you didn't override methods of the parser? Did you override methods of something else, like maybe ContentHandler?

And finally, you do have a DTD which declares those entities, do you not? If you didn't, then your XML would not be well-formed.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
I use DefaultHandler2 sax parser that is given by xercesImpl.jar. My start and end entity methods overrides DefaultHandler2 start and end entity methods. My jdk is 1.6_25. Yes I have a DTD and I use it in resolveEntity method. Below is the code i use to output the attribute values into child xml (has some custom logic)
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
I am also posting the code that I use to read the input xml using xml reader.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Hi,
Any suggestions on handling entities that come in attribute values? Just a suggestion on how can i approach the issue will be much helpful. I just tried to search and replace the & ensp ; ( ) in the attribute value but it does not show up as the same string unable for me to compare & replace.
Thanks,
John
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

I missed the explanation of why you decided it was necessary for you to "handle" entities in attribute values. If it were up to me I would let the parser do what it normally does, which is to expand them into whatever the DTD says it should.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul, the parser puts a ? in place of a & ensp ; in the attribute value. Since i use the attribute value for some other purpose (highlighting text in some gui) i am very much struck here Atleast if i can read it as a string like & ensp ; i can find / replace. It is also not reading as a whole ( which i doubt should be because UTF-8 encoded XMLReader? ). Any ways the DTD has all the entities and the xml is UTF-8 encoded while created.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

John Jai wrote:Paul, the parser puts a ? in place of a &ensp; in the attribute value.


Are you sure it's not just you screwing up the encoding later in your code? I'm willing to bet you are testing that statement by writing that character somewhere and looking at it in that other place.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
This is an integration testing issue. The xml is a production xml and the issue occurs only with the attribute values. I am just curious if SAX parser does not handle or send Attribute values? or might be my handling of attributes is wrong - i should not handle them in startElement() method?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

The startElement() method is the right place to handle attributes. Of course you do have to handle them correctly... for example if an attribute value include an n-dash character or anything else which can't be expressed in (say) ISO-8859-1 and then later you write that value out encoded in ISO-8859-1 then you are going to see a question mark.

Which is why I suggested you might be at fault and not the parser. The question mark definitely suggests an encoding failure, and the parser doesn't do any character encoding.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
I think the above reply is meaningful. I may be writing & mdash ; into UTF-8 format which might not support it. I am going to investigate on the encoding and their respective entities are supported. Thanks for your continued support.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

UTF-8 is an encoding which can handle any Unicode character, including the m-dash character. But it's quite possible that you think you are
using UTF-8 but you are actually using some other encoding somewhere.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
You mean to say I use a different encoding while writing? I use the OutputStreamWriter with UTF-8 encoding like below
the Buffer i use is a CharArrayWriter.

Atleast after getting attribute value using "UTF-8" format as advised, I was able to remove the ? and put some junk value instead of the entities. So i believe now the character encoding match reading and writing of xml but only the writer does not know how to output the entity value on the output xml file.
Change i made to get rid of ? is below -


Now this is my input xml & output xml snippets


Input XML attribute
=============
guideword-prefix="Sec. & ensp ;" // (read the "& ensp ;" without spaces. Coderanch editor replaces it with no value)

Output XML attribute
==============
guideword-prefix="Sec. "

Previously output xml was like
=====================
guideword-prefix="Sec.?"


Thanks,
john
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

Yes, it's what I thought. When you look at the data which was output, you aren't using UTF-8 to decode it. So your output is correct but you aren't looking at it right. The hack which you described in that last post shows that clearly: you encode the data using UTF-8 and then you decode it using your system's default encoding. That produces the "â€" garbage which is a clear symptom of UTF-8 data being misinterpreted (you'll see that all over the Web where people have pasted in data from e.g. Microsoft Word but haven't been careful about their encodings.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
I am unaware of the encoding technology. Its new to me.
Paul Clapham wrote:When you look at the data which was output, you aren't using UTF-8 to decode it.

1. Since the xml is encoded with UTF-8 encoding i need to decode it back again using UTF-8? If yes i am using an XML Reader with UTF8 encoded Input stream and also use
String.getBytes("UTF-8") while getting attribute's values which should decode it back from UTF-8?
2. I need to write the xml as UTF-8 encoded. I hope atleast I am doing this right by using an OutputStreamWriter with UTF8 encoding (shown in my prev post)
Paul Clapham wrote: you encode the data using UTF-8 and then you decode it using your system's default encoding. That produces the "â€" garbage which is a clear symptom of UTF-8 data being misinterpreted.

I hope decode here means reading from the XML and the issue is not while writing it into the new xml (though the char appear on new xml even when i use Syso to print in the startElement() method i get this junk value. So i am sure i am reading xml wrongly somewhere) I will check on this issue.

Thanks paul for enlighting me this far.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

"Encoding" means converting Unicode characters to bytes. "Decoding" is the reverse process. All Strings in Java are composed of Unicode characters, but data in files is stored in bytes, so you encode your Java data when writing it to a file and decode it when reading it from a file.

Your OutputStreamWriter is correct, yes.

Your code which attempts to "fix" the encoding of attribute values is not correct. You should get the attribute values from the parser as Unicode characters, and there should be no need to encode them with one encoding and then decode them with a different encoding.

As for reading XML, it's the parser's responsibility to decode the data. Normally if the parser is given a stream of bytes (like some kind of an InputStream) it will look at the prolog of the XML document and figure out what encoding to use. Of course if the document's declared encoding doesn't match the encoding which was actually used to produce the document, then problems may ensue. The technical term for people who do that sort of thing is "bozos".

However if you pass the parser a stream of chars (like some kind of a Reader) then the parser won't be able to do the decoding, since the Reader will already have done that. So if you choose a Reader with the wrong encoding, then problems may ensue.

I should warn you that by writing your own XML data from a StringBuffer (my guess) to a file, you are in danger of being a bozo yourself. I don't know what's in the XML_DECLARATION variable but it should match the actual encoding (UTF-8) which you are using to write the data. Whenever possible you should try to use standard XML-processing code to write XML.

Right now you are encountering problems because you are writing out data and then reading it in and observing things which appear wrong. So there are three possible things that could cause that wrongness: (1) Incorrect data being written out (2) Damage caused by the writing-out process (3) Damage caused by the reading-in process. And it's difficult to determine which of the three it is. Solution: Inspect the potentially problematic data before writing it out. Examine the actual Unicode value of the char in question, in your Java code, to see if it matches the Unicode code point for an em-dash or whatever the character is.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
Since i am at home now, i wrote a custom sax parser and xmls to reproduce the issue. When i used the same encoding (my PC's Cp1252 encoding) to create the input xml, to read the input xml and to write the output xml i never see the wierd symbol any more. I also made that wierd symbol print on the output when I tried simply to encode with "UTF-8". Below are the code and xmls i used to reproduce the issue. Please delete any of the below if you feel its irrevelant to the thread.

Sample XML
========
<?xml version="1.0" encoding="Cp1252"?>
<!DOCTYPE persons SYSTEM "G:/WorkDirectory/sampledtd.dtd">
<persons>
<!-- read below & ensp ; without spaces - code ranch editor removes it
<person name="John" age="& ensp ;26">
<!-- below & nbsp ; without space between good guy
<attribute>Good & ensp ; Guy</attribute>
</person>
</persons>

The DTD
======
<!ELEMENT persons (person+)>
<!ELEMENT person (attribute)>
<!ELEMENT attribute (#PCDATA)>
<!ATTLIST person name CDATA #REQUIRED>
<!ATTLIST person age CDATA #REQUIRED>
<!ENTITY ensp " ">

The Output XML
===========
<?xml version="1.0" encoding="Cp1252"?>
<persons>
<person name="John" age=" 26">
<attribute>Good   Guy</attribute>
</person>
</persons>

The SAX parser code below

1. Now I can confirm that the wierd symbol was due to different encodings in my application code (real one @ office)
2. Notice that the issue of printing entities still persist. There are two & ensp ; i have placed in the input xml that goes missing in the output xml. I can bring the second & ensp ; in the attribute element (I should not given attribute as element name out in these confusions ) by using startEntity() and endEntity() methods. I did that successfully in local. But that & ensp ; in the age attribute's value is never showing up. I just put this code to show that no way i am able to print those that come as attribute values against those come in CDATA. See below i printed using endEntity() method the & nbsp ; inbetween Good Guy but before age is not coming. Also if you look keenly you can see a white space where the actual entity is hiding
<edit> removed quote. any how the code ranch editor removes the & ensp ; entity shown between the Good Guy value </edit>
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

Looks like you are making progress.

As far as displaying things, the forum is probably interpreting those entities as if they were the actual HTML entity instead of as text. If you want to type them accurately in your posts and have them show up as entities rather than being interpreted as HTML, then you have to escape them.

For example if you want the string "X&ensp;Y" to appear then you have to escape the ampersand at the beginning like this: "X&amp;ensp;Y". If you don't do that then it will be treated as an actual en-space and come out like this: "X Y".

There's a preview button available. Try those things, learning about escaping is useful for XML authors.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
Any thought on why the entities are not showing up on the file. Should there be any escape given like this in the code? "&ensp;" ... Even when i am printing the values on the console there is only empty string (rather one char) for the whole "&ensp;" entity. Guide me where i making mistake.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

Normally I would expect the parser to translate the entities into the actual characters they represent. I wouldn't expect to see the string "&ensp;" in the result, I would expect to see whatever Unicode character the en-space is assigned to.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
I fear you will beat me if I ask how to write the unicode chars they represent back as entity into the output xml. will you
Ok i corrected the mistake in the DTD . &ensp; was assigned by ME to &#160; and not to & amp ; #160; So it was not coming in the output xml file. Now they are like a charm like below
<person name="John" age="&#160;26">
<attribute>Good &#160; Guy</attribute>

Ok. now i changed the &ensp; value in DTD to &#1600000; and this junk value came in output xml (just to test if what i gave is coming).
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul,
Can you help me write the values back as &ensp; in the output xml file. It is having the values specified by the DTD. On browsing i found that SAXHandler class has setExpandingEntities(false) which will make expanding entities never happen by parser. Should i use that or can it be done in normal sax code i wrote only?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

John Jai wrote:Can you help me write the values back as &ensp; in the output xml file.


I don't know why you would want to do that. After all the document with the DTD and the entity names is identical, logically, to the document with no DTD and entity values. The main reason for the version with the DTD is to make it convenient for humans to produce the document.

In other words the only reason to insist on the entity names is to make the output pretty. It doesn't make any difference to XML-processing code whether it looks pretty or not, though.

So the question is, just how important is it to you to make the document look pretty?

Bear in mind that there is usually no way to tell whether a Unicode character U+00A0 given to you by the parser was entered as &nbsp; or as &x#a0; or as &#160; or in some other way. Those are all conveniences for the person at the keyboard and they are all equivalent as far as XML-processing software is concerned. You could certainly write code which automatically converts the character U+00A0 to the six characters &nbsp; before you write it out to your XML document, but there's no guarantee that it actually was represented by &nbsp; in the original document.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Hi Paul,
The xmls will be feed into another application which does highlighting on the contents of the xmls. That existing code is written to handle &ensp; and when it encounters weird symbols its not working. But the weird symbols could be caused due to the earlier encoding issue. I will send the corrected xmls (with value for entities) and let you know if they still face any issues.

Thanks again for helping!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Entities in attribute values issue in Sax parser