File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Other Application Frameworks and the fly likes Web Crawler in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Java Interview Guide this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Frameworks » Other Application Frameworks
Bookmark "Web Crawler in Java" Watch "Web Crawler in Java" New topic

Web Crawler in Java

Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

I have a task in which I have to scan the HTML source of web pages and extract some information depending upon the pattern. The extracted information will be save in the database for the business purpose. The amount of data being extracted in not much.

I am searching for an appropriate web crawler written in java.
Does anyone has some suggestion to give or any other inputs to be share.

Any kind of help is highly appreciated.

My Blog SCJP 5 SCWCD 5
Lester Burnham

Joined: Oct 14, 2008
Posts: 1337
You can find crawlers at For extracting information, use a library like HtmlUnit.
Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

Thanks Lester for the reply. I have selected few crawlers projects and have decided to try them making a small prototype of each.
Michael Ergotron

Joined: Nov 19, 2010
Posts: 5
I use httpclient for similar purposes...

Another option is to call wget command from java. It works fine when you have https

Find friends to hang out with
Vlada Stankovic

Joined: Feb 18, 2010
Posts: 14

Its better to use perl for this kind of tasks. You have mechanize or LWP library on CPAN. HTMLUnit is too slow.
David Davidov

Joined: Dec 09, 2011
Posts: 1
I think so you can try php examples firs, which you can find here:
I agree. Here's the link:
subject: Web Crawler in Java
It's not a secret anymore!