File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Sockets and Internet Protocols and the fly likes URL Harvester Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "URL Harvester" Watch "URL Harvester" New topic

URL Harvester

Rosie Nelson
Ranch Hand

Joined: Nov 06, 2001
Posts: 31
I'm trying to create a Java Application which given a URL will copy to file all the .htm pages of a particular Web site. What is the best strategy for collecting all the URLs relating to the various files on a Web Site? Is there a class which contains a collection for e.g of all the URL's in a given web site or does one have to start with the home page and search through that page for all http references and locate URL's that way.
If anybody could shed any light on this matter I would be very appreciative.
gigel chiazna

Joined: Jul 09, 2001
Posts: 17
First, you should now there are already some free Java crawlers out there you could use and customize.
I also developed a "downloader" years ago when I didn't have Internet access and Teleport was not a choice.
You should start by thinking thoroughly a design as this is not as simple as it looks, a spider has many aspects.
Here are some toughts:
  • make more download threads and have a manager for them
  • have a reference table to keep track of each file status (downloaded, downloading, parsing etc)
  • build links at the finish of all downloadings

  • As for the urls problem, there's no class no give you all the links a page, but you could use regular expressions. Also note that URL(host, any_file) give you an absolute correct url, no matter file si relative to host or is an outside url.
    Also, if you want a challenge - and a feature that I don't know any spider that offers it -, figure out links that are build using JavaScript
    [ January 23, 2002: Message edited by: gigel chiazna ]

    <a href="" target="_blank" rel="nofollow">Stock Market Java Applets</a>
    Muralidhar Krishnamoorthy

    Joined: Mar 18, 2001
    Posts: 13
    Can any one post the code for downloading the html file from the web using java just by giving the url?
    Thomas Paul
    mister krabs
    Ranch Hand

    Joined: May 05, 2000
    Posts: 13974
    URL url = new URL("");
    BufferedReader br = new BufferedReader (new InputStreamReader(url.openStream( )));
    while((input = br.readLine( )) != null)

    Associate Instructor - Hofstra University
    Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
    Muralidhar Krishnamoorthy

    Joined: Mar 18, 2001
    Posts: 13
    Thank you very much. But can i transfer the file as such like in ftp instead of getting through the buffered reader etc..?

    I agree. Here's the link:
    subject: URL Harvester
    It's not a secret anymore!