I am working on parsing a web page using java. I have crawled around web and read about the various parsers - html parser, jtidy, jericho etc.
I am in confusion as to which parser to use.
I have to basically parse a page, for example eBay, and then retrieve the results for a given query. For example, if laptop is the query, i want to be able to retrieve the various results populated by the server.
Do I have to use a third party API or java provides something which can be handy for my problem?
Your best bet is a parser; which to use is pretty much up to you.
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35254
7
posted
0
The easiest is probably to use one of the parsers that create valid XML (like TagSoup) and then to treat the problem as an XML processing issue. That way you can use XPath or XQuery. You may also want to check out HtmlUnit which handles the HTML retrieval as well as the HTML parsing.
I am newbie for this field so was looking for a parser with good documentation so that I can get hold of it and use it in future without any problems.
Could you suggest a decent parser with good documentation?
Thanks much!
Nilesh
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35254
7
posted
0
Good documentation is not always available for open source projects - you get what you pay for. I suggest to investigate the libraries I mentioned, and see if you run into any problems.
Nilesh Vijaywargiay
Greenhorn
Joined: Mar 27, 2010
Posts: 7
posted
0
HI Thanks. I've run into a problem
I am stuck at a position in which I have to retrieve the text between a particular tag
<div ... > I want to be retrieved </div>
<a> I want to be retrieved </a>
Any suggestions? I am currently using jericho parser.
The results I am getting through the program are not exactly in the same order as it appears on the website. This phenomenom is not regular but I am wondering why ..
Is it because the query is being fired from a program rather than a browser? I tried setting the currentCompatibilityConfig of jericho parser to MOZILLA. IE but didn't find the results to change. I have read that you have to set a user agent? I coudn't find a way to do that in jericho parser. Any suggestions?
I have no idea--I've never used it. Personally, I'd get the HTML source using something else that would allow me to set the user agent, handled cookies, etc.
As far as what search results are used, that would depend entirely on the website you're trying to scrape (please make sure you're not violating anybody's terms of service). It could depend on the user agent, cookies, previous searches, moon phase... anything.