This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Beginning Java and the fly likes how to parse html webpage Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "how to parse html webpage" Watch "how to parse html webpage" New topic
Author

how to parse html webpage

naga raaju
Greenhorn

Joined: Mar 14, 2008
Posts: 29
hi guys can anybody give idea to parse html webpage live url parsing

using java.


i have code but the out put is in the form of html tags
so how can i split the tags so give idea friends

import java.net.*;
import java.io.*;

public class URLReader {
public static void main(String[] ar) throws Exception {

URL yahoo = new URL("http://finance.yahoo.com");
BufferedReader in = new BufferedReader(new InputStreamReader(yahoo.openStream()));
BufferedWriter wr=new BufferedWriter(new FileWriter("sample.txt"));

String inputLine;
while ((inputLine = in.readLine()) != null)
// System.out.println(inputLine);
try
{
wr.write(inputLine);
}catch(Exception e)
{
e.printStackTrace();
}
in.close();
}
}
bye
Naga
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41621
    
  55
There are many things you might want to accomplish with a downloaded web page. You need to tell us what you're trying to do with it.

If you want to extract the text, I'd start by converting the HTML into well-formed XML; libraries like NekoXNI, JTidy and TagSoup can do this for you.


Ping & DNS - my free Android networking tools app
naga raaju
Greenhorn

Joined: Mar 14, 2008
Posts: 29
hi
thanks for your reply,
i need some text from the web pages.so what sholud i do.


can i depend on third party API. or that is possible with java coding.


bye
Naga
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8877
    
    8

There is an HTML parser provided in the Java API. As Ulf says, it depends on your exact requirements whether it will fit the bill or not.


"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41621
    
  55
That depends on the specifics. Are you talking about one particular page on one particular site? Several pages? Several sites? Is the layout of the page(s) predictable? Are there ID tags on which you can rely?

You will need to do some coding, but the libraries I mentioned will help you get started.
Randi Randwa
Greenhorn

Joined: Feb 21, 2009
Posts: 7

You can also use biterscripting (.com for free download) for parsing html. It works great.

They have a sample script posted at http://www.biterscripting.com/SS_URLs.html . This script extracts referenced URLs from a page. Another sample script http://www.biterscripting.com/SS_SearchURL.html will search a page for specific search words. The sample script http://www.biterscripting.com/SS_SearchWeb.html is de facto your own search engine.

You can get started with these scripts.

If you come up with new html parsing scripts of your own, can you please post them for the rest of us ? Thanks.

Randi
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38509
    
  23
Welcome to JavaRanch, Randi but please don't resurrect 10-month old threads. Have a look at this FAQ.
 
 
subject: how to parse html webpage