my dog learned polymorphism
The moose likes Beginning Java and the fly likes scanning the webpage Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "scanning the webpage" Watch "scanning the webpage" New topic

scanning the webpage

Harbir Singh

Joined: May 11, 2005
Posts: 1
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great.
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Should I reply here or over here?

I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.

<a href="" target="_blank" rel="nofollow">Java forums using Java software</a> - Come and help get them started.
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14991

You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class for that:

If you want to parse HTML, you could use a library like this one:

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Pradhip Prakash

Joined: Oct 13, 2005
Posts: 3
This program will ask for url. You enter the correct url and press enter.


public class URLRead
public static void main(String[] args) throws Exception
BufferedReader br = new BufferedReader(new InputStreamReader(;
String s;
System.out.println("Enter Your URL :");
s = br.readLine();

URL url = new URL(s);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)

Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
I agree. Here's the link:
subject: scanning the webpage
It's not a secret anymore!