wood burning stoves 2.0*
The moose likes Beginning Java and the fly likes scanning the webpage Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "scanning the webpage" Watch "scanning the webpage" New topic
Author

scanning the webpage

Harbir Singh
Greenhorn

Joined: May 11, 2005
Posts: 1
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great.
Thanx
Regards
Harbir
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Should I reply here or over here?

I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.


<a href="http://forums.hotjoe.com/forums/list.page" target="_blank" rel="nofollow">Java forums using Java software</a> - Come and help get them started.
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14278
    
  21

You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class java.net.URL for that:

If you want to parse HTML, you could use a library like this one: http://htmlparser.sourceforge.net/


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Pradhip Prakash
Greenhorn

Joined: Oct 13, 2005
Posts: 3
This program will ask for url. You enter the correct url and press enter.

import java.net.*;
import java.io.*;

public class URLRead
{
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String s;
System.out.println("Enter Your URL :");
s = br.readLine();


URL url = new URL(s);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);

in.close();
}
}
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: scanning the webpage