• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Screen scrapping(extract data from webpage) in java

 
muthu bharathi
Ranch Hand
Posts: 97
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi everybody,


i want to know how to do the screen scrapping in java. or it have any open source tool to extract the data from the website and stored it in a XML or excel any format....


Please help me as soon as possible


--
Regards,
M. Bharathi
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd probably use a library like jWebUnit for downloading the pages, and extracting the relevant parts. Then you can use any XML- or XLS-creating library you like for storing the interesting parts.
 
muthu bharathi
Ranch Hand
Posts: 97
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi ulf,


Thanks for your quick response.




I was searched in net and i got a one open source tool. it's working fine for "HTTP" only.... i need to scrap the data from "HTTPS"....


Im a new bie.... i tried to write the code using JWEBUNIT. but it's not working... can you give me sample code to write in JWEBUNIT and also i want to know "JWEBUNIT" support "HTTPS", because ineed to extract the data from "HTTPS" also......

Awaiting for your reply......

--
With Thanks
M. Bharathi
 
Gamini Sirisena
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
this may be what you need. First result for "jwebunit https" in google.

The site talks about untrusted certificates. So jwebunit may already be trusting a number of certificates from certificate authorities. It might be using the java truststore itself?
 
muthu bharathi
Ranch Hand
Posts: 97
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,


Thanks for your response. I have scrap the data from http / https through one opensource web data extractor tool...

But one issue in that tool. i cannot scrap the data from https having session(the page has session). please help me or guide me for this issue. i was searched in net but... i face only failure....


--
with thanks,
M. Bharathi
 
Campbell Ritchie
Sheriff
Pie
Posts: 49379
62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Too difficult a question for beginners. Moving.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
But one issue in that tool. i cannot scrap the data from https having session(the page has session).

Why not? jWebUnit supports cookie, if that's what's used for the sessions. If the session use URL rewriting, then there's no problem to begin with.
 
muthu bharathi
Ranch Hand
Posts: 97
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,


i was tried a lot. but i cant get the output. please give me sample source....




--
regds,
M. Bharathi
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What have you tried? Post a relevant code excerpt. What, exactly, happened when you ran it?
 
nhu dinh thuan
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
muthu bharathi wrote:Hi everybody,


i want to know how to do the screen scrapping in java. or it have any open source tool to extract the data from the website and stored it in a XML or excel any format....


Please help me as soon as possible


--
Regards,
M. Bharathi


View here, screenshot http://binhgiang.sourceforge.net/xmlalbum/screenshots.html

and download free version web data extrator http://binhgiang.sourceforge.net/site/download.jsp.

VDer build from java html parser, download from http://sourceforge.net/projects/binhgiang/files/htmlparser/HTMLParser2_Build9.zip/download. Is is open source.
 
muthu bharathi
Ranch Hand
Posts: 97
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,


Thanks for your valuable guidance....

One thing i need to be known is it scrap the https data's........



--
With Thanks
M. bharathi
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic