An addition...I've just found Xerxes-J. Anyone know if this is appropriate? I'm guessing it's quality given it's an Apache project...
[Update - 5 mins later]
It's no good (for what I want to do)...to quote their "common problems" section...
Unfortunately, HTML does not, in general, follow the XML grammar rules. Most HTML files do not meet the XML style quidelines. Therefore, the XML parser generates XML well-formedness errors.
(...)
HTML must match the XHTML standard for well-formedness before it can be parsed by Xerces-J or any other XML parser. You can find the XHTML standard on the W3C web site.
Now I'm looking at Jericho (
http://sourceforge.net/projects/jerichohtml/)...sounds like there's some potential there.
-Tim
[ April 02, 2004: Message edited by: Tim West ]
[ April 02, 2004: Message edited by: Tim West ]