File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Extract selected links from html Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Extract selected links from html" Watch "Extract selected links from html" New topic
Author

Extract selected links from html

purnima Nair
Greenhorn

Joined: Oct 23, 2008
Posts: 8
I need to extract links and text from the html page.


The above code gets all the links and text.I need to get the text corresponding to particular links.
Html page has lots of links.I need to get only selected links except the links in the header ,footer and side menus.Please help.
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
Is this for a particular web site? If so, the headers, footers and menus are probably part of a named DIV, or have a particular class associated. HTML parsers like HtmlUnit (try this one first), nekohtml, htmlcleaner or TagSoup should be able to give you access to that information.
purnima Nair
Greenhorn

Joined: Oct 23, 2008
Posts: 8
No this is to get links from any website not related to a particular website.
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
That is most likely impossible to achieve in the general sense, unless you can give an algorithm that determines which links should be selected.
purnima Nair
Greenhorn

Joined: Oct 23, 2008
Posts: 8
Since it is required for all the websites and we cannot generalise for all the different websites,it will not be possible to get only the selected links.Right??
Now my another query:
My code mentioned above retrives all the text and all the links from the website.So then how can we get the only text corresponding to the particular links.
Eg:
I need to get the text 'Hello' corresponding to the href link 'www.google.com'.

Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
They're not passed in via the handleText method?

The Swing HTML stuff is kinda weak, though. If this was my problem, I'd use HtmlUnit.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract selected links from html
 
Similar Threads
Relative URLs
HTML Parser unrecognized tags
Regex Problem
Scanning HTML page for HREF AND IMG tags
Parsing HTML using Java