Win a copy of Learn Spring Security (video course) this week in the Spring forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

scanning the webpage

 
Harbir Singh
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great.
Thanx
Regards
Harbir
 
Scott Dunbar
Ranch Hand
Posts: 245
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Should I reply here or over here?

I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15203
36
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class java.net.URL for that:

If you want to parse HTML, you could use a library like this one: http://htmlparser.sourceforge.net/
 
Pradhip Prakash
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This program will ask for url. You enter the correct url and press enter.

import java.net.*;
import java.io.*;

public class URLRead
{
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String s;
System.out.println("Enter Your URL :");
s = br.readLine();


URL url = new URL(s);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);

in.close();
}
}
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.
 
Scott Dunbar
Ranch Hand
Posts: 245
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic