wood burning stoves 2.0*
The moose likes General Computing and the fly likes Screen scrapping(extract data from webpage) in java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » General Computing
Bookmark "Screen scrapping(extract data from webpage) in java" Watch "Screen scrapping(extract data from webpage) in java" New topic
Author

Screen scrapping(extract data from webpage) in java

muthu bharathi
Ranch Hand

Joined: Dec 10, 2008
Posts: 97
Hi everybody,


i want to know how to do the screen scrapping in java. or it have any open source tool to extract the data from the website and stored it in a XML or excel any format....


Please help me as soon as possible


--
Regards,
M. Bharathi
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42376
    
  64
I'd probably use a library like jWebUnit for downloading the pages, and extracting the relevant parts. Then you can use any XML- or XLS-creating library you like for storing the interesting parts.


Ping & DNS - my free Android networking tools app
muthu bharathi
Ranch Hand

Joined: Dec 10, 2008
Posts: 97
Hi ulf,


Thanks for your quick response.




I was searched in net and i got a one open source tool. it's working fine for "HTTP" only.... i need to scrap the data from "HTTPS"....


Im a new bie.... i tried to write the code using JWEBUNIT. but it's not working... can you give me sample code to write in JWEBUNIT and also i want to know "JWEBUNIT" support "HTTPS", because ineed to extract the data from "HTTPS" also......

Awaiting for your reply......

--
With Thanks
M. Bharathi
Gamini Sirisena
Ranch Hand

Joined: Aug 05, 2008
Posts: 357
this may be what you need. First result for "jwebunit https" in google.

The site talks about untrusted certificates. So jwebunit may already be trusting a number of certificates from certificate authorities. It might be using the java truststore itself?
muthu bharathi
Ranch Hand

Joined: Dec 10, 2008
Posts: 97
Hi,


Thanks for your response. I have scrap the data from http / https through one opensource web data extractor tool...

But one issue in that tool. i cannot scrap the data from https having session(the page has session). please help me or guide me for this issue. i was searched in net but... i face only failure....


--
with thanks,
M. Bharathi
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39478
    
  28
Too difficult a question for beginners. Moving.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42376
    
  64
But one issue in that tool. i cannot scrap the data from https having session(the page has session).

Why not? jWebUnit supports cookie, if that's what's used for the sessions. If the session use URL rewriting, then there's no problem to begin with.
muthu bharathi
Ranch Hand

Joined: Dec 10, 2008
Posts: 97
Hi,


i was tried a lot. but i cant get the output. please give me sample source....




--
regds,
M. Bharathi
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42376
    
  64
What have you tried? Post a relevant code excerpt. What, exactly, happened when you ran it?
nhu dinh thuan
Greenhorn

Joined: Jul 20, 2004
Posts: 3
muthu bharathi wrote:Hi everybody,


i want to know how to do the screen scrapping in java. or it have any open source tool to extract the data from the website and stored it in a XML or excel any format....


Please help me as soon as possible


--
Regards,
M. Bharathi


View here, screenshot http://binhgiang.sourceforge.net/xmlalbum/screenshots.html

and download free version web data extrator http://binhgiang.sourceforge.net/site/download.jsp.

VDer build from java html parser, download from http://sourceforge.net/projects/binhgiang/files/htmlparser/HTMLParser2_Build9.zip/download. Is is open source.
muthu bharathi
Ranch Hand

Joined: Dec 10, 2008
Posts: 97
Hi,


Thanks for your valuable guidance....

One thing i need to be known is it scrap the https data's........



--
With Thanks
M. bharathi
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Screen scrapping(extract data from webpage) in java