File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Beginning Java and the fly likes scanning the webpage Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "scanning the webpage" Watch "scanning the webpage" New topic

scanning the webpage

Harbir Singh

Joined: May 11, 2005
Posts: 1
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great.
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Should I reply here or over here?

I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.

<a href="" target="_blank" rel="nofollow">Java forums using Java software</a> - Come and help get them started.
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 15092

You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class for that:

If you want to parse HTML, you could use a library like this one:

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Pradhip Prakash

Joined: Oct 13, 2005
Posts: 3
This program will ask for url. You enter the correct url and press enter.


public class URLRead
public static void main(String[] args) throws Exception
BufferedReader br = new BufferedReader(new InputStreamReader(;
String s;
System.out.println("Enter Your URL :");
s = br.readLine();

URL url = new URL(s);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)

Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Scott Dunbar
Ranch Hand

Joined: Sep 23, 2004
Posts: 245
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
I agree. Here's the link:
subject: scanning the webpage
It's not a secret anymore!