jQuery in Action, 2nd edition*
The moose likes Other Application Frameworks and the fly likes Web Crawler in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Frameworks » Other Application Frameworks
Bookmark "Web Crawler in Java" Watch "Web Crawler in Java" New topic
Author

Web Crawler in Java

Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

I have a task in which I have to scan the HTML source of web pages and extract some information depending upon the pattern. The extracted information will be save in the database for the business purpose. The amount of data being extracted in not much.

I am searching for an appropriate web crawler written in java.
Does anyone has some suggestion to give or any other inputs to be share.

Any kind of help is highly appreciated.

My Blog SCJP 5 SCWCD 5
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
You can find crawlers at www.java-source.net. For extracting information, use a library like HtmlUnit.
Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

Thanks Lester for the reply. I have selected few crawlers projects and have decided to try them making a small prototype of each.
Michael Ergotron
Greenhorn

Joined: Nov 19, 2010
Posts: 5
I use httpclient for similar purposes...

Another option is to call wget command from java. It works fine when you have https


Find friends to hang out with
www.hangoutwithus.org
Vlada Stankovic
Greenhorn

Joined: Feb 18, 2010
Posts: 14

Its better to use perl for this kind of tasks. You have mechanize or LWP library on CPAN. HTMLUnit is too slow.
David Davidov
Greenhorn

Joined: Dec 09, 2011
Posts: 1
I think so you can try php examples firs, which you can find here:

http://broobee.com/folder/12/crawlers-applications
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Web Crawler in Java