I am using kXml for parsing a well formed html page and I am having a problem since this parser expands entity references in attributes values. Since the page that I am parsing is an HTML page it contains something on these lines
As you can see the parser reads the attribute values &itc=0 and thinks that it is a begingning of an entity and then falls over since it doesnt get an ending ; it complains that it could not resolve &itc
But as you can see that is not an entity ref rather it is paramters passed to the page FooToos.aspx.
So comming back to my questions.
Has anyone going around and modified kXml source code so that it doesnt be too smart and starts expanding all the entity references it encounters in attribute values.
I had a similar problem in my XHTML page and it turned out to be a bug on my side. I would say that file is not well formed. Every literal "&" must be escaped with "& a m p ;" even within attribute values. The W3C validator will not like your page either.
If you want to use a literal ampersand in your document you must encode it as "& a m p ;" (even inside URLs!).
[ July 11, 2004: Message edited by: Alexander Traud ]
Joined: Mar 03, 2001
Thanks for the reply, the only problem in my case is that the HTML page that I am parsing is not a page developed my me. It is a page of some web site so I do not have access to the html generated by them.
Off the top of my head, it doesn't sound like a great deal of work. For an HTTP GET, it is probably relatively straightforward and perhaps with some googling one could find lots of helper packages and APIs for converting HTML to XHTML.
Theoretically, a MIDlet could also do a conversion to XHTML too. But MIDlet size and memory (e.g. for large HTML pages) might be problematic.
Your mileage may vary :-). james
Joined: Mar 03, 2001
But I had talk with the site that is providing the html page and I asked them if they provide me with a web service rather than me parsing html pages and told me that they can provide me with an xml reply ( rather than an html). So I am back in the game now parsing the xml document, though I had to ditch all the old code that I had written to parse the html page.