This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
I am having difficulty reading two well-formed HTML document using JDOM when running offline (not on the Internet). The first few lines of these documents are listed below:
( i ) This program would does not work even while it was running on-line (has Internet access). The execution process would exit on line 10 but not sure whether it completes it or not. Don't understand why though? ( ii ) What is the difference between the two files as far as the format goes? I thought HTML 4.01 is equivalent to XHTML 1.0? In other word, they are already well-formatted and so they can be parse directly by an XML parser such as Xerces. In other word, it is not necessary to use tool such as Tidy to convert to clean up missing tags? ( iii ) why are the tags in the former file in capital? Do parsers in general distinguish tags in capital compared to lower case?
I am very new to XML parsing and would appreciate some guidances.
Could any see where this issue is coming from? The author of the same thread suggest that line 5 should add an extra parameter (SAXBuilder saxBuilder = new SAXBuilder(false, "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"). However, the SAXBuilder constructor does not accept the second paramter. Any ideas? Many thanks, Jack
( i ) Printing the exception is a good way to start, then trace the error.
( ii ) XML is case sensitive and HTML 4.01 is not XHTML. A document is well formed when the tags are balanced(opened and closed properly, attributes, comments, etc., basically the document looks right). A document is valid when it conforms to a schema in general(DTD, XML Schema, whatever is specified to enforce some rules in the document). Reading some tutorials is probably your best bet.
( iii ) An XML parser will attempt to parse well-formed XML(balanced tags - case sensitive). Some XML parsers have limitations(validation support, schema handling, etc.). HTML parsers will parse HTML and some of them can handle nasty HTML(incorrect or unknown tags, unbalanced tags, etc.). Basically you need to choose the best tool for your needs(speed, usability, features). [ October 03, 2008: Message edited by: Yves Zoundi ]