aspose file tools*
The moose likes Beginning Java and the fly likes Java parsing a web page Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Java parsing a web page" Watch "Java parsing a web page" New topic
Author

Java parsing a web page

Jim Size
Greenhorn

Joined: Aug 10, 2010
Posts: 29
hello there, i am new into this forum, although this forum helped me a lot with a big project i had last summer, so thanks for the great information you have

so i want to know how can i add in my program some code that it will visit a web page and read some things in it.
(after that i want to save the info into a file, but i know how to do that).
i saw another old topic in this forum that covers a similar problem by using somehow HTML.
i don't have HTML experience so i can't understand

it would be great if you can help.

do i have to use the HTML parser ??

thanks for your time


Nicola Garofalo
Ranch Hand

Joined: Apr 10, 2010
Posts: 308
Hi J,
i bet you will be shortly tortured for your nickname

What do you want to parse, all html tags, of just some you need?

if it's just for an exercise you could build the parser by yourself using regular expressions for example

If you are looking for a library you could use html parsers ready to use.

With a google search i found a list you can explore at this link

http://java-source.net/open-source/html-parsers


Bye,
Nicola
Jim Size
Greenhorn

Joined: Aug 10, 2010
Posts: 29
i would love to take all the torture haha mr. garoufalo

yeah i found myself some parsers but i really don't get the idea of <<HTML>> parsers, i can't understand what it does.

for example i want this part of my program to get a paragraph from a blog. with the verb "get" i mean i can have access to copy the paragraph and then maybe save it etc etc.
i ll still use the HTML parser??

sorry for the confusion but i am totally new into this part of java.
Nicola Garofalo
Ranch Hand

Joined: Apr 10, 2010
Posts: 308
You could use a parser but you don't have to. I repeat, it's up to you to decide what you want to do.
A parser would parse an html document and as a result you have the complete control over the html document structure.
If it is too much for you and you need just a word recognizer then build it with regular expressions.

If you have already put down some ideas just share them here and someone will surely help you with your issues, step by step.

Sean Clark
Rancher

Joined: Jul 15, 2009
Posts: 377

Hey,

I was recently doing the same, I recommend HTML Unit it acts like a browser and will get data from urls and parse it with a good high level api. As well as some DOM/Xpath tools for when things get a bit difficult.

Sean


I love this place!
Gaurav Raje
Ranch Hand

Joined: Jul 23, 2010
Posts: 136
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

Thats what the Parser says.

Apart from this, most parsers will tokenize html code into units, so you would have to deal with tokens instead of plain text and have some semantic information about the html code you are dealing with.
Depending on the parser, the tokens may be different.
Some tokens may tokenize the data into objects and array(which i think HTML unit does).
Mostly all will parse it into xml.
Gaurav Raje
Ranch Hand

Joined: Jul 23, 2010
Posts: 136
J Sizeas wrote:i would love to take all the torture haha mr. garoufalo

yeah i found myself some parsers but i really don't get the idea of <<HTML>> parsers, i can't understand what it does.

for example i want this part of my program to get a paragraph from a blog. with the verb "get" i mean i can have access to copy the paragraph and then maybe save it etc etc.
i ll still use the HTML parser??

sorry for the confusion but i am totally new into this part of java.


If its simple like this, you might be better off using string/text operations. You need parsers if you are thinking of slightly more sophisticated stuff. Not that you cant use parsers, but it might end up being more tedious. try reading up about regular expressions and see if you can use just that. Eitherways, personal experience says even if you do have parsers for web mining, you do end up needing regular expressions
Jim Size
Greenhorn

Joined: Aug 10, 2010
Posts: 29
1 last thing, maybe i didn't make my self clear about my goal. i want to take some paragraphs (the user will decide) from the websites posts, not just a word.

how can i do all this jobs with simple "string/text operations" for example??? with the help of URL constructor?

Gaurav Raje if there is a book or website that tells about string/text operations that do this job, it would be great if you can share em so i can learn .

i want to say thanks for the post. Its a great help because you gave me new things to search on that i couldn't find anywhere.

it would be great if you can give me some sites or suggestions for good books that do this job and teaches you about the parsers ideology because i am very interested.

thanks again guys
Gaurav Raje
Ranch Hand

Joined: Jul 23, 2010
Posts: 136
if these paragraphs are seperated by say <p> tags or may be some other unique technique (which i believe they are). I think regular expressions should be fine.
www.regular-expressions.info would be your best resource. I love the site and they have a lot of things to offer. alternatively there are quite a few basic tutorials which you miht find by googling
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Java parsing a web page