This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
I might use jWebUnit for making sense of HTML. It puts a nice API on top of the page that's easier to use than dealing with XML. Don't be put off that it's billed as a testing tool - using it to access HTML pages works just fine. Actually, I think it may use JTidy underneath as well.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com