I was trying to create a DOM from a html source. Which doesn't work because there are lots of tag that are unbalanced. I used a tag balancer which worked fine and closed all the tags, though I am a little skeptical about the correctness of the resulting html, If it display the same page in the browser, or the balanced tags will garble the layout. Anyway once I have this balanced xhtml document I am looking to convert it to a DOM structure that I could use for purposes like traversal, search, etc. Any suggestions??? Thanks in advance
Some HTML files can never be properly balanced, since HTML is much too lenient. For instance, it allows for tags to overlap:
So how would you do this? Once you encounter the </b>, close the <i> too and then start another <i>? That would do it, but there are other real-life HTML documents that will be much, MUCH harder.
There are a number of libraries that convert HTML -even those that are not well-formed- into DOM documents. Google for TagSoup, JTidy and CyberNeko in particular.