| Author |
scanning the webpage
|
Harbir Singh
Greenhorn
Joined: May 11, 2005
Posts: 1
|
|
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great. Thanx Regards Harbir
|
 |
Scott Dunbar
Ranch Hand
Joined: Sep 23, 2004
Posts: 245
|
|
Should I reply here or over here? I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.
|
<a href="http://forums.hotjoe.com/forums/list.page" target="_blank" rel="nofollow">Java forums using Java software</a> - Come and help get them started.
|
 |
Jesper de Jong
Java Cowboy
Bartender
Joined: Aug 16, 2005
Posts: 12907
|
|
You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class java.net.URL for that: If you want to parse HTML, you could use a library like this one: http://htmlparser.sourceforge.net/
|
Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
|
 |
Pradhip Prakash
Greenhorn
Joined: Oct 13, 2005
Posts: 3
|
|
This program will ask for url. You enter the correct url and press enter. import java.net.*; import java.io.*; public class URLRead { public static void main(String[] args) throws Exception { BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); String s; System.out.println("Enter Your URL :"); s = br.readLine(); URL url = new URL(s); BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) System.out.println(inputLine); in.close(); } }
|
 |
Stan James
(instanceof Sidekick)
Ranch Hand
Joined: Jan 29, 2003
Posts: 8791
|
|
|
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.
|
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
|
 |
Scott Dunbar
Ranch Hand
Joined: Sep 23, 2004
Posts: 245
|
|
|
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
|
 |
 |
|
|
subject: scanning the webpage
|
|
|