Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Dealing with bad XML

 
Michael Zalewski
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a web server (which is supplied by an outside vendor), which outputs responses in XML. The problem is that the XML is bad, and occasionally cannot be parsed by my application, which uses Xerces.
The bad XML happens for two reasons.
1) PCDATA elements can come from a database, and have embedded & and <> characters. The server should wrap these elements in CDATA tags, or use entitys, but it doesn't.
2) The XML header produced by the server claims it is UTF-8, but it is really ISO-8559. So if the XML contains any Unicode, which it occasionally does, the parser gets messed up.
My question: What can I do?
My idea is that I should be able to catch the input stream before it comes into Xerces, and fix it up. For example, I could take the header and replace the string <?xml .. encoding="UTF-8"?> and simply replace it with <?xml .. encoding="ISO-8559-1">. Since the element tags are all known and only 3 levels deep, I can escape those elements that should contain entities ('&' becomes '&', etc.)
But how do I do that? Or is there some other way?
(I know the best way would be to fix the web server that sends the bad XML. But there really is no good way to do that. I don't own the source to the servlet, and the template that this server uses does not contain very flexible string processing capabilities).
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13058
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you are right - you will have to filter it before it gets to Xerces. Fortunately the java.io classes and XML parsers lend themselves to fooling around with stream content.
You should be able to create a new Reader to handle the unicode conversion...
Bill
 
Roseanne Zhang
Ranch Hand
Posts: 1953
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Use regular expression to do the text processing, fast and safe. Of course, you need to know what you are doing.
Another thing you can do is outputing the xerces error message to a file for further analyzing. That will help the text processing to do the right things.
 
Michael Zalewski
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by William Brogden:
I think you are right - you will have to filter it before it gets to Xerces. Fortunately the java.io classes and XML parsers lend themselves to fooling around with stream content.
You should be able to create a new Reader to handle the unicode conversion...
Bill

I have no problem writing my own Reader. In fact, I'd *love* to use Apache HttpClient to retrieve this XML, only because HttpClient has an effective timeout mechanism. But how do I hook my reader up to Xerces?
Here is the problem. The vendor who supplied the server which writes bad XML also supplied a Java package to read and parse this XML. The java package will create a rather long URL based on. The code supplied by the vendor, which I do have access to, does this:

Can I tell the DOMParser where to stick my reader?
[ November 08, 2002: Message edited by: Michael Zalewski ]
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13058
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have been thinking about that....
In the latest JAXP we have DocumentBuilder. DocumentBuilder wants an InputSource, we need to look at that class - behold! there is a constructor:
InputSource( java.io.Reader rd )
Cool!
My memory is a little vague on DOMParser - isn't there a parse() that takes a Reader?
Bill
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic