aspose file tools
The moose likes Beginning Java and the fly likes Parsing HTML using Java Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


Win a copy of The Mikado Method this week in the Agile and other Processes forum!
JavaRanch » Java Forums » Java » Beginning Java
Reply Bookmark "Parsing HTML using Java" Watch "Parsing HTML using Java" New topic
Author

Parsing HTML using Java

Mazhar Ismail
Greenhorn

Joined: Sep 10, 2008
Posts: 11
Hi,

I have a requirement of parsing an HTML page and pulling out a text from a specific HTML tag. This is the first time I am working on this. I am able to read the Tags and their Id's and also the complete text on the page but have no idea how to read the text enclosed in a specific tag. I have written my code below. I want to grab the text within <td id="dept1">Sales</td> only i.e., "Sales" in this case. Please help me.





--
Mazhar

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]
[ October 09, 2008: Message edited by: Mazhar Ismail ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 35237
    
    7
I would guess that you need to override the "handleText" method.


Android appsImageJ pluginsJava web charts
Mazhar Ismail
Greenhorn

Joined: Sep 10, 2008
Posts: 11
tried overriding.i guess i am doing it wrong.any example how to do it.

Thanks,
Mazhar
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 35237
    
    7
How did you try (post the relevant code excerpt)? Was the method called? If so, what values did the parameters have?
Rene Larsen
Ranch Hand

Joined: Oct 12, 2001
Posts: 1179

A HTML page is basically a XML document - you could try parse the HTML page using a DOM or SAX parser.

Java API for XML Code Samples


Regards, Rene Larsen
Dropbox Invite
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19216

Just remember that handleText is not required to handle all the text in a node in one go. Use StringBuilder to combine it; you can finish it in the handleEndTag method.


SCJP 1.4 - SCJP 6 - SCWCD 5
How To Ask Questions How To Answer Questions
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19216

Originally posted by Rene Larsen:
A HTML page is basically a XML document

If you're lucky. HTML allows nesting of tags, missing end tags, missing quotes around attributes, and much more that is not allowed in XML.ent.

That's why XHTML is invented. It's basically HTML that truely is XML. For instance, it requires <br> to be ended: <br />.
[ October 10, 2008: Message edited by: Rob Prime ]
 
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to run our stuff on 16 servers instead of 3.
 
subject: Parsing HTML using Java
 
Similar Threads
Scanning HTML page for HREF AND IMG tags
Handling Html tag
Relative URLs
Navigating through the HTML table using code
Reg:: HTMLParser