This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Java in General and the fly likes Creating a DOM from html Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Creating a DOM from html" Watch "Creating a DOM from html" New topic
Author

Creating a DOM from html

Gautam Velpula
Greenhorn

Joined: Aug 04, 2008
Posts: 13
I was trying to create a DOM from a html source.
Which doesn't work because there are lots of tag that are unbalanced.
I used a tag balancer which worked fine and closed all the tags, though I am a little skeptical about the correctness of the resulting html, If it display the same page in the browser, or the balanced tags will garble the layout.
Anyway once I have this balanced xhtml document I am looking to convert it to a DOM structure that I could use for purposes like traversal, search, etc.
Any suggestions???
Thanks in advance
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19649
    
  18

Some HTML files can never be properly balanced, since HTML is much too lenient. For instance, it allows for tags to overlap:

So how would you do this? Once you encounter the </b>, close the <i> too and then start another <i>? That would do it, but there are other real-life HTML documents that will be much, MUCH harder.

Once you've done this (I suggest taking a look at javax.swing.text.html.parser.ParserDelegator, with a custom callback instance) you can use libraries like JDOM for creating the DOM tree.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41017
    
  43
Welcome to JavaRanch.

There are a number of libraries that convert HTML -even those that are not well-formed- into DOM documents. Google for TagSoup, JTidy and CyberNeko in particular.


Ping & DNS - my free Android networking tools app
Gautam Velpula
Greenhorn

Joined: Aug 04, 2008
Posts: 13
I have used nekohtml to balance the tags.
The resulting file is still a unparsable document.

Do you think this is a wise direction, trying to create a html dom? Should I rather look fo alternatives.

Thanks for the replies.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41017
    
  43
I guess NekoHTML is a bit weaker than the other ones - both JTidy and TagSoup claim to produce DOM objects.
Gautam Velpula
Greenhorn

Joined: Aug 04, 2008
Posts: 13
I will try them. Thanks
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Creating a DOM from html
 
Similar Threads
my attribute value is coming in html tag how to parse that value in java
Saving DOM modifications between page refreshes
regarding awk ? very urgent !!!
populate <html:options> with AJAX response
How does DOM work ?