aspose file tools*
The moose likes XML and Related Technologies and the fly likes Could not parse well-formed HTML 4.01/XHTML 1.0 document using JDOM    Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Could not parse well-formed HTML 4.01/XHTML 1.0 document using JDOM    " Watch "Could not parse well-formed HTML 4.01/XHTML 1.0 document using JDOM    " New topic
Author

Could not parse well-formed HTML 4.01/XHTML 1.0 document using JDOM

Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi All,

I am having difficulty reading two well-formed HTML document using JDOM when running offline (not on the Internet). The first few lines of these documents are listed below:



( i ) This program would does not work even while it was running on-line (has Internet access). The execution process would exit on line 10 but not sure whether it completes it or not. Don't understand why though?
( ii ) What is the difference between the two files as far as the format goes? I thought HTML 4.01 is equivalent to XHTML 1.0? In other word, they are already well-formatted and so they can be parse directly by an XML parser such as Xerces. In other word, it is not necessary to use tool such as Tidy to convert to clean up missing tags?
( iii ) why are the tags in the former file in capital? Do parsers in general distinguish tags in capital compared to lower case?

I am very new to XML parsing and would appreciate some guidances.

This question has been posted on http://forums.sun.com/thread.jspa?threadID=5335817 to get different ideas on how best to resolve this issue.

Thanks a lot,
Jack
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi,

I am following a possible solution (http://devdiary.motime.com/post/471628/Why+implement+your+own+EntityResolver?) on how to redirect references to entities within an XML document to a local file but do not understand why it is not picking up the parsing file (former). Below is a complete change of ZipcodeTidy2JDomParser to include my own EntityResolver:

Could any see where this issue is coming from?
The author of the same thread suggest that line 5 should add an extra parameter (SAXBuilder saxBuilder = new SAXBuilder(false, "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"). However, the SAXBuilder constructor does not accept the second paramter. Any ideas?
Many thanks,
Jack
Yves Zoundi
Ranch Hand

Joined: Aug 31, 2008
Posts: 47
( i ) Printing the exception is a good way to start, then trace the error.

( ii ) XML is case sensitive and HTML 4.01 is not XHTML. A document is well formed when the tags are balanced(opened and closed properly, attributes, comments, etc., basically the document looks right). A document is valid when it conforms to a schema in general(DTD, XML Schema, whatever is specified to enforce some rules in the document). Reading some tutorials is probably your best bet.

( iii ) An XML parser will attempt to parse well-formed XML(balanced tags - case sensitive). Some XML parsers have limitations(validation support, schema handling, etc.). HTML parsers will parse HTML and some of them can handle nasty HTML(incorrect or unknown tags, unbalanced tags, etc.). Basically you need to choose the best tool for your needs(speed, usability, features).
[ October 03, 2008: Message edited by: Yves Zoundi ]

Author of VFSJFileChooser and XPontus XML Editor
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi Yves,

Thanks for advice and I managed to solve most HTML to XML parsing issue by using a useful tool called Html2XML.

Please refer to http://forums.sun.com/thread.jspa?threadID=5335817&tstart=0 for detail.

Cheers,

Jack
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Could not parse well-formed HTML 4.01/XHTML 1.0 document using JDOM
 
Similar Threads
not locating javascript and CSS path
updating database records via jsp
Ajax foriegn word issue
How to parse XML document with default namespace with JDOM XPath
project file structure for web application (war).