Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing HTML using Java

 
Mazhar Ismail
Greenhorn
Posts: 11
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I have a requirement of parsing an HTML page and pulling out a text from a specific HTML tag. This is the first time I am working on this. I am able to read the Tags and their Id's and also the complete text on the page but have no idea how to read the text enclosed in a specific tag. I have written my code below. I want to grab the text within <td id="dept1">Sales</td> only i.e., "Sales" in this case. Please help me.





--
Mazhar

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]

[ October 09, 2008: Message edited by: Mazhar Ismail ]
[ October 09, 2008: Message edited by: Mazhar Ismail ]
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would guess that you need to override the "handleText" method.
 
Mazhar Ismail
Greenhorn
Posts: 11
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
tried overriding.i guess i am doing it wrong.any example how to do it.

Thanks,
Mazhar
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How did you try (post the relevant code excerpt)? Was the method called? If so, what values did the parameters have?
 
Rene Larsen
Ranch Hand
Posts: 1179
Eclipse IDE Mac OS X
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A HTML page is basically a XML document - you could try parse the HTML page using a DOM or SAX parser.

Java API for XML Code Samples
 
Rob Spoor
Sheriff
Pie
Posts: 20380
45
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just remember that handleText is not required to handle all the text in a node in one go. Use StringBuilder to combine it; you can finish it in the handleEndTag method.
 
Rob Spoor
Sheriff
Pie
Posts: 20380
45
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Rene Larsen:
A HTML page is basically a XML document

If you're lucky. HTML allows nesting of tags, missing end tags, missing quotes around attributes, and much more that is not allowed in XML.ent.

That's why XHTML is invented. It's basically HTML that truely is XML. For instance, it requires <br> to be ended: <br />.
[ October 10, 2008: Message edited by: Rob Prime ]
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic