File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Sockets and Internet Protocols and the fly likes URL Harvester Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "URL Harvester" Watch "URL Harvester" New topic
Author

URL Harvester

Rosie Nelson
Ranch Hand

Joined: Nov 06, 2001
Posts: 31
I'm trying to create a Java Application which given a URL will copy to file all the .htm pages of a particular Web site. What is the best strategy for collecting all the URLs relating to the various files on a Web Site? Is there a class which contains a collection for e.g of all the URL's in a given web site or does one have to start with the home page and search through that page for all http references and locate URL's that way.
If anybody could shed any light on this matter I would be very appreciative.
Roz
gigel chiazna
Greenhorn

Joined: Jul 09, 2001
Posts: 17
First, you should now there are already some free Java crawlers out there you could use and customize.
I also developed a "downloader" years ago when I didn't have Internet access and Teleport was not a choice.
You should start by thinking thoroughly a design as this is not as simple as it looks, a spider has many aspects.
Here are some toughts:
  • make more download threads and have a manager for them
  • have a reference table to keep track of each file status (downloaded, downloading, parsing etc)
  • build links at the finish of all downloadings


  • As for the urls problem, there's no class no give you all the links a page, but you could use regular expressions. Also note that URL(host, any_file) give you an absolute correct url, no matter file si relative to host or is an outside url.
    Also, if you want a challenge - and a feature that I don't know any spider that offers it -, figure out links that are build using JavaScript
    [ January 23, 2002: Message edited by: gigel chiazna ]

    <a href="http://www.stockapplets.com" target="_blank" rel="nofollow">Stock Market Java Applets</a>
    Muralidhar Krishnamoorthy
    Greenhorn

    Joined: Mar 18, 2001
    Posts: 13
    Hi
    Can any one post the code for downloading the html file from the web using java just by giving the url?
    Thanks
    Murali
    Thomas Paul
    mister krabs
    Ranch Hand

    Joined: May 05, 2000
    Posts: 13974
    URL url = new URL("http://java.sun.com");
    BufferedReader br = new BufferedReader (new InputStreamReader(url.openStream( )));
    while((input = br.readLine( )) != null)
    System.out.println(input);
    br.close();


    Associate Instructor - Hofstra University
    Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
    Muralidhar Krishnamoorthy
    Greenhorn

    Joined: Mar 18, 2001
    Posts: 13
    Thank you very much. But can i transfer the file as such like in ftp instead of getting through the buffered reader etc..?

    Cheers
    Murali
    muralidharck@yahoo.com
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: URL Harvester
     
    Similar Threads
    use of google appliance with websphere portalV5.1
    is site available or not ?
    a jsp design question...
    URL rewriting issue
    how to find web service from UDDI ?