Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Webpage Scrapping in Java

 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi There,

I am scratching my head with this.

I want to get list of agencies from this webpage

[SOME WEBSITE DELETED]

I only want to use Java. How can I achieve this?

Thanks
dexter
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What are you actually trying to accomplish? You need to figure out if page scraping is the best approach to whatever that is.

If it turns out the answer to that question is "yes," then you need to check out the terms of use of that site and see if they actually permit that. If not, you'll have to find a different approach.
 
Henry Wong
author
Marshal
Pie
Posts: 21115
78
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Sorry for deleting the reference to the website -- there was no reason to use the example, and it felt like spam with it.

Henry
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64830
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scraping is so 1990's. If the website wants to allow you to grab their data, they'll have provided an API to do so -- a web service, JavaScript API, Java API, and so on.

If not, then you'd be stealing content and that is not something that CodeRanch can help you with. It's unethical and illegal in some areas.
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
well, I want to to scrape this website for agent phone number and email addresses using java.

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.

Basically I want to try this task using java. I saw the page source of this website and it uses JSDL.
will that help me anyhow?


Thanks
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64830
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you intend to use this for personal use only?
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Absolutely YES!

I am a java beginner and want to achieve this task with java.

Thanks
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13061
6
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The basic tool for screen scraping is the HttpURLConnection in the java.net package.

With the URL for the target page you can open a connection and getInputStream() to get a readable stream of characters.

Now, in the old days, when HTML pages were simple, that was enough. Now pages may be made up of many parts from various places, so the very first thing to do is examine the makeup of the page you want to scrape with a browser having a developer plugin. Say Firefox with the Firebug plugin.

Hopefully it will turn out to be simple but no guarantee.

Bill

 
Henry Wong
author
Marshal
Pie
Posts: 21115
78
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Karthik Sanghvi wrote:Absolutely YES!

I am a java beginner and want to achieve this task with java.


I know people are going to recommend third-party open-source packages for this... so, I am going to mention the java.net.URL class, which is built into java. (EDIT: looks like William recommended the same thing, just minutes earlier (darn!!))

Many years ago, I used it to do simple scrapping. And after many years of adding features, GET/POST calls, SSL support, etc. etc. etc., I find that I could do everything with it. So, if you don't want to add a jar file (open source or otherwise), the URL class works well.

Henry
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Karthik Sanghvi wrote:well, I want to to scrape this website for agent phone number and email addresses using java.


And that can't be accomplished just as well or better some other way? Such as just viewing the site in a browser, or using an API or WebService that they may have provided, as mentioned already?

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.


That still doesn't mean it's okay for you to scrape it. That just basically says that you can't redistribute the content and that you don't own it. It doesn't mean that all forms of personal use are okay, only that all other forms of use are not okay.
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This particular program scrapes the tile of the webpage, how can I get to the agent name, etc.

 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK Jeff,

If we still have privacy issues here. I will scrap the idea of scraping this particular website page.
The main aim/purpose of my question is how can I go any further to achieve something like this.

Thanks
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Check out the HttpUnit and jWebUnit libraries for far more convenient programmatic web site processing than the previously suggested approaches (these are the libraries Henry alluded to :-)
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Karthik Sanghvi wrote:
If we still have privacy issues here. I will scrap the idea of scraping this particular website page.


I'm not saying scrap it, and I'm not saying they won't allow you to do it. I'm just saying that what you provide doesn't show that they do allow it, and it's up to you to do the research needed to find out for sure.

The main aim/purpose of my question is how can I go any further to achieve something like this.


Well, that's kind of a step backwards then. With a broad, vague requirement such as "something like this," the answers for general technical approaches have already been provided--use the API or webservice provided by the site if possible, otherwise use one of the libraries suggested.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic