• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Liutauras Vilda
  • Paul Clapham
Sheriffs:
  • paul wheaton
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Piet Souris
Bartenders:
  • Mike London

parse a web page using Java

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I am working on parsing a web page using java. I have crawled around web and read about the various parsers - html parser, jtidy, jericho etc.
I am in confusion as to which parser to use.

I have to basically parse a page, for example eBay, and then retrieve the results for a given query. For example, if laptop is the query, i want to be able to retrieve the various results populated by the server.

Do I have to use a third party API or java provides something which can be handy for my problem?

Thanks much!
Nilesh
 
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your best bet is a parser; which to use is pretty much up to you.
 
Master Rancher
Posts: 43045
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The easiest is probably to use one of the parsers that create valid XML (like TagSoup) and then to treat the problem as an XML processing issue. That way you can use XPath or XQuery. You may also want to check out HtmlUnit which handles the HTML retrieval as well as the HTML parsing.
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks David and Ulf.

I am newbie for this field so was looking for a parser with good documentation so that I can get hold of it and use it in future without any problems.
Could you suggest a decent parser with good documentation?

Thanks much!
Nilesh
 
Ulf Dittmer
Master Rancher
Posts: 43045
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Good documentation is not always available for open source projects - you get what you pay for. I suggest to investigate the libraries I mentioned, and see if you run into any problems.
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
HI Thanks. I've run into a problem

I am stuck at a position in which I have to retrieve the text between a particular tag
<div ... > I want to be retrieved </div>
<a> I want to be retrieved </a>

Any suggestions? I am currently using jericho parser.
 
David Newton
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'd suggest looking at its documentation and examples, there are *many* examples of doing precisely what you're trying to do.

http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Element.html
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a ton David!!! Was able to finish off the task!! Appreciate the help
 
David Newton
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Great--glad you got it working :)
 
Nilesh Vijaywargiay
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
David One more question.

The results I am getting through the program are not exactly in the same order as it appears on the website. This phenomenom is not regular but I am wondering why ..

Is it because the query is being fired from a program rather than a browser? I tried setting the currentCompatibilityConfig of jericho parser to MOZILLA. IE but didn't find the results to change. I have read that you have to set a user agent? I coudn't find a way to do that in jericho parser. Any suggestions?

 
Marshal
Posts: 77263
371
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No longer a "beginning" topic. Moving thread.
 
David Newton
Author
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have no idea--I've never used it. Personally, I'd get the HTML source using something else that would allow me to set the user agent, handled cookies, etc.

As far as what search results are used, that would depend entirely on the website you're trying to scrape (please make sure you're not violating anybody's terms of service). It could depend on the user agent, cookies, previous searches, moon phase... anything.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic