Author
how to parse html webpage
naga raaju
Greenhorn
Joined: Mar 14, 2008
Posts: 29
hi guys can anybody give idea to parse html webpage live url parsing using java . i have code but the out put is in the form of html tags so how can i split the tags so give idea friends import java.net.*; import java.io.*; public class URLReader { public static void main(String [] ar) throws Exception { URL yahoo = new URL("http://finance.yahoo.com"); BufferedReader in = new BufferedReader (new InputStreamReader (yahoo.openStream())); BufferedWriter wr=new BufferedWriter (new FileWriter ("sample.txt")); String inputLine; while ((inputLine = in.readLine()) != null) // System.out.println(inputLine); try { wr.write(inputLine); }catch(Exception e) { e.printStackTrace(); } in.close(); } } bye Naga
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35258
posted Apr 23, 2008 05:04:00
0
There are many things you might want to accomplish with a downloaded web page. You need to tell us what you're trying to do with it. If you want to extract the text, I'd start by converting the HTML into well-formed XML; libraries like NekoXNI, JTidy and TagSoup can do this for you.
Android apps – ImageJ plugins – Java web charts
naga raaju
Greenhorn
Joined: Mar 14, 2008
Posts: 29
hi thanks for your reply, i need some text from the web pages.so what sholud i do. can i depend on third party API. or that is possible with java coding. bye Naga
Joe Ess
Bartender
Joined: Oct 29, 2001
Posts: 8265
There is an HTML parser provided in the Java API. As Ulf says, it depends on your exact requirements whether it will fit the bill or not.
"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch ]
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35258
posted Apr 23, 2008 06:39:00
0
That depends on the specifics. Are you talking about one particular page on one particular site? Several pages? Several sites? Is the layout of the page(s) predictable? Are there ID tags on which you can rely? You will need to do some coding, but the libraries I mentioned will help you get started.
Randi Randwa
Greenhorn
Joined: Feb 21, 2009
Posts: 7
You can also use biterscripting (.com for free download) for parsing html. It works great.
They have a sample script posted at http://www.biterscripting.com/SS_URLs.html . This script extracts referenced URLs from a page. Another sample script http://www.biterscripting.com/SS_SearchURL.html will search a page for specific search words. The sample script http://www.biterscripting.com/SS_SearchWeb.html is de facto your own search engine.
You can get started with these scripts.
If you come up with new html parsing scripts of your own, can you please post them for the rest of us ? Thanks.
Randi
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32717
Welcome to JavaRanch, Randi but please don't resurrect 10-month old threads. Have a look at this FAQ .
subject: how to parse html webpage