*
The moose likes Other Open Source Projects and the fly likes parse a web page using Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "parse a web page using Java" Watch "parse a web page using Java" New topic
Author

parse a web page using Java

Nilesh Vijaywargiay
Greenhorn

Joined: Mar 27, 2010
Posts: 7
Hi,

I am working on parsing a web page using java. I have crawled around web and read about the various parsers - html parser, jtidy, jericho etc.
I am in confusion as to which parser to use.

I have to basically parse a page, for example eBay, and then retrieve the results for a given query. For example, if laptop is the query, i want to be able to retrieve the various results populated by the server.

Do I have to use a third party API or java provides something which can be handy for my problem?

Thanks much!
Nilesh
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Your best bet is a parser; which to use is pretty much up to you.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42367
    
  64
The easiest is probably to use one of the parsers that create valid XML (like TagSoup) and then to treat the problem as an XML processing issue. That way you can use XPath or XQuery. You may also want to check out HtmlUnit which handles the HTML retrieval as well as the HTML parsing.


Ping & DNS - my free Android networking tools app
Nilesh Vijaywargiay
Greenhorn

Joined: Mar 27, 2010
Posts: 7
Thanks David and Ulf.

I am newbie for this field so was looking for a parser with good documentation so that I can get hold of it and use it in future without any problems.
Could you suggest a decent parser with good documentation?

Thanks much!
Nilesh
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42367
    
  64
Good documentation is not always available for open source projects - you get what you pay for. I suggest to investigate the libraries I mentioned, and see if you run into any problems.
Nilesh Vijaywargiay
Greenhorn

Joined: Mar 27, 2010
Posts: 7
HI Thanks. I've run into a problem

I am stuck at a position in which I have to retrieve the text between a particular tag
<div ... > I want to be retrieved </div>
<a> I want to be retrieved </a>

Any suggestions? I am currently using jericho parser.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

I'd suggest looking at its documentation and examples, there are *many* examples of doing precisely what you're trying to do.

http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Element.html
Nilesh Vijaywargiay
Greenhorn

Joined: Mar 27, 2010
Posts: 7
Thanks a ton David!!! Was able to finish off the task!! Appreciate the help
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Great--glad you got it working :)
Nilesh Vijaywargiay
Greenhorn

Joined: Mar 27, 2010
Posts: 7
David One more question.

The results I am getting through the program are not exactly in the same order as it appears on the website. This phenomenom is not regular but I am wondering why ..

Is it because the query is being fired from a program rather than a browser? I tried setting the currentCompatibilityConfig of jericho parser to MOZILLA. IE but didn't find the results to change. I have read that you have to set a user agent? I coudn't find a way to do that in jericho parser. Any suggestions?

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
No longer a "beginning" topic. Moving thread.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

I have no idea--I've never used it. Personally, I'd get the HTML source using something else that would allow me to set the user agent, handled cookies, etc.

As far as what search results are used, that would depend entirely on the website you're trying to scrape (please make sure you're not violating anybody's terms of service). It could depend on the user agent, cookies, previous searches, moon phase... anything.
 
 
subject: parse a web page using Java