File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes JSP and the fly likes JSP Crawler Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » JSP
Bookmark "JSP Crawler" Watch "JSP Crawler" New topic
Author

JSP Crawler

Shashank Agarwal
Ranch Hand

Joined: May 20, 2004
Posts: 105
Hey everyone. I was trying to build a crawler, a search engine type crawler to crawl web pages and create reports from it. Well, this crawler will be a JSP. Now, the problem is that how do i make it follow a link. Lets say it href="http://www.javaranch.com" then its ok, and i can get the substring between the two double quotes. However, if the link is to an internal page, then most pages have it as href="page2.html" or "../page2.html". Here how do i make the crawler go to the page2.html

I hope I'm able to put across my problem.

Thanks in advance.
Pritam Barhate
Greenhorn

Joined: Nov 25, 2004
Posts: 15
See the artical Create intelligent Web spiders at JavaWorld.


Pritam Barhate<br />A magic combination of <b>Code</b> & <b>Fire</b> : <a href="http://www.jroller.org/page/codefire/Weblog" target="_blank" rel="nofollow">codefire</a><br />----------------------------------- <br />My Open Source Projects:<br /><a href="https://acemdi.dev.java.net/" target="_blank" rel="nofollow">AceMDI</a>: A easy, yet powerful MDI framework that manages windows as Tabs.
Sonny Gill
Ranch Hand

Joined: Feb 02, 2002
Posts: 1211

And there is a chapter (or two) on it in The Art Of Java book.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12769
    
    5
There is no real reason to make this a JSP since you are going to end up with huge amounts of computation and data. Why not work on the guts of the crawler as a stand-alone application until you get it working right.

Trying to do it in JSP code will just be confusing and hard to debug.
Bill
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: JSP Crawler