File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Beginning Java and the fly likes Parsing HTML using Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Parsing HTML using Java" Watch "Parsing HTML using Java" New topic
Author

Parsing HTML using Java

Mazhar Ismail
Greenhorn

Joined: Sep 10, 2008
Posts: 11
Hi,

I have a requirement of parsing an HTML page and pulling out a text from a specific HTML tag. This is the first time I am working on this. I am able to read the Tags and their Id's and also the complete text on the page but have no idea how to read the text enclosed in a specific tag. I have written my code below. I want to grab the text within <td id="dept1">Sales</td> only i.e., "Sales" in this case. Please help me.





--
Mazhar

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]
[ October 09, 2008: Message edited by: Mazhar Ismail ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39547
    
  27
I would guess that you need to override the "handleText" method.


Ping & DNS - updated with new look and Ping home screen widget
Mazhar Ismail
Greenhorn

Joined: Sep 10, 2008
Posts: 11
tried overriding.i guess i am doing it wrong.any example how to do it.

Thanks,
Mazhar
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39547
    
  27
How did you try (post the relevant code excerpt)? Was the method called? If so, what values did the parameters have?
Rene Larsen
Ranch Hand

Joined: Oct 12, 2001
Posts: 1179

A HTML page is basically a XML document - you could try parse the HTML page using a DOM or SAX parser.

Java API for XML Code Samples


Regards, Rene Larsen
Dropbox Invite
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19541
    
  16

Just remember that handleText is not required to handle all the text in a node in one go. Use StringBuilder to combine it; you can finish it in the handleEndTag method.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19541
    
  16

Originally posted by Rene Larsen:
A HTML page is basically a XML document

If you're lucky. HTML allows nesting of tags, missing end tags, missing quotes around attributes, and much more that is not allowed in XML.ent.

That's why XHTML is invented. It's basically HTML that truely is XML. For instance, it requires <br> to be ended: <br />.
[ October 10, 2008: Message edited by: Rob Prime ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing HTML using Java
 
Similar Threads
Relative URLs
Handling Html tag
Navigating through the HTML table using code
Reg:: HTMLParser
Scanning HTML page for HREF AND IMG tags