File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes JSP and the fly likes JSP Crawler Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of REST with Spring (video course) this week in the Spring forum!
JavaRanch » Java Forums » Java » JSP
Bookmark "JSP Crawler" Watch "JSP Crawler" New topic

JSP Crawler

Shashank Agarwal
Ranch Hand

Joined: May 20, 2004
Posts: 105
Hey everyone. I was trying to build a crawler, a search engine type crawler to crawl web pages and create reports from it. Well, this crawler will be a JSP. Now, the problem is that how do i make it follow a link. Lets say it href="" then its ok, and i can get the substring between the two double quotes. However, if the link is to an internal page, then most pages have it as href="page2.html" or "../page2.html". Here how do i make the crawler go to the page2.html

I hope I'm able to put across my problem.

Thanks in advance.
Pritam Barhate

Joined: Nov 25, 2004
Posts: 15
See the artical Create intelligent Web spiders at JavaWorld.

Pritam Barhate<br />A magic combination of <b>Code</b> & <b>Fire</b> : <a href="" target="_blank" rel="nofollow">codefire</a><br />----------------------------------- <br />My Open Source Projects:<br /><a href="" target="_blank" rel="nofollow">AceMDI</a>: A easy, yet powerful MDI framework that manages windows as Tabs.
Sonny Gill
Ranch Hand

Joined: Feb 02, 2002
Posts: 1211

And there is a chapter (or two) on it in The Art Of Java book.
William Brogden
Author and all-around good cowpoke

Joined: Mar 22, 2000
Posts: 13027
There is no real reason to make this a JSP since you are going to end up with huge amounts of computation and data. Why not work on the guts of the crawler as a stand-alone application until you get it working right.

Trying to do it in JSP code will just be confusing and hard to debug.
It is sorta covered in the JavaRanch Style Guide.
subject: JSP Crawler
It's not a secret anymore!