JTidy sounds cool. I have also used the
Quiotix HTML Parser. It builds a DOM and provides a
Visitor interface for walking the DOM and some sample visitors.
Was that the original question, or were you trying to get the HTML from a server in the first place? Here's an example of doing that with URL:
You have to know the URL you're after, so it won't automatically grab all the content of a site. You could grab a page, parse it, look for links, grab linked pages, parse them, etc. Watch for circular links and watch for a ticked off webmaster who doesn't appreciate you taking expensive mips and bandwidth from the regular customers while copying copyrighted material.
Some sites that WANT you to do this use RSS publishing. Neat trend.
[ July 03, 2003: Message edited by: Stan James ]