This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Java in General and the fly likes Webpage Scrapping in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Webpage Scrapping in Java" Watch "Webpage Scrapping in Java" New topic
Author

Webpage Scrapping in Java

Karthik Sanghvi
Greenhorn

Joined: Nov 06, 2006
Posts: 9
Hi There,

I am scratching my head with this.

I want to get list of agencies from this webpage

[SOME WEBSITE DELETED]

I only want to use Java. How can I achieve this?

Thanks
dexter
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

What are you actually trying to accomplish? You need to figure out if page scraping is the best approach to whatever that is.

If it turns out the answer to that question is "yes," then you need to check out the terms of use of that site and see if they actually permit that. If not, you'll have to find a different approach.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18917
    
  40


Sorry for deleting the reference to the website -- there was no reason to use the example, and it felt like spam with it.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61458
    
  67

Scraping is so 1990's. If the website wants to allow you to grab their data, they'll have provided an API to do so -- a web service, JavaScript API, Java API, and so on.

If not, then you'd be stealing content and that is not something that CodeRanch can help you with. It's unethical and illegal in some areas.


[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
Karthik Sanghvi
Greenhorn

Joined: Nov 06, 2006
Posts: 9
well, I want to to scrape this website for agent phone number and email addresses using java.

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.

Basically I want to try this task using java. I saw the page source of this website and it uses JSDL.
will that help me anyhow?


Thanks
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61458
    
  67

Do you intend to use this for personal use only?
Karthik Sanghvi
Greenhorn

Joined: Nov 06, 2006
Posts: 9
Absolutely YES!

I am a java beginner and want to achieve this task with java.

Thanks
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12809
    
    5
The basic tool for screen scraping is the HttpURLConnection in the java.net package.

With the URL for the target page you can open a connection and getInputStream() to get a readable stream of characters.

Now, in the old days, when HTML pages were simple, that was enough. Now pages may be made up of many parts from various places, so the very first thing to do is examine the makeup of the page you want to scrape with a browser having a developer plugin. Say Firefox with the Firebug plugin.

Hopefully it will turn out to be simple but no guarantee.

Bill

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18917
    
  40

Karthik Sanghvi wrote:Absolutely YES!

I am a java beginner and want to achieve this task with java.


I know people are going to recommend third-party open-source packages for this... so, I am going to mention the java.net.URL class, which is built into java. (EDIT: looks like William recommended the same thing, just minutes earlier (darn!!))

Many years ago, I used it to do simple scrapping. And after many years of adding features, GET/POST calls, SSL support, etc. etc. etc., I find that I could do everything with it. So, if you don't want to add a jar file (open source or otherwise), the URL class works well.

Henry
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Karthik Sanghvi wrote:well, I want to to scrape this website for agent phone number and email addresses using java.


And that can't be accomplished just as well or better some other way? Such as just viewing the site in a browser, or using an API or WebService that they may have provided, as mentioned already?

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.


That still doesn't mean it's okay for you to scrape it. That just basically says that you can't redistribute the content and that you don't own it. It doesn't mean that all forms of personal use are okay, only that all other forms of use are not okay.
Karthik Sanghvi
Greenhorn

Joined: Nov 06, 2006
Posts: 9
This particular program scrapes the tile of the webpage, how can I get to the agent name, etc.

Karthik Sanghvi
Greenhorn

Joined: Nov 06, 2006
Posts: 9
OK Jeff,

If we still have privacy issues here. I will scrap the idea of scraping this particular website page.
The main aim/purpose of my question is how can I go any further to achieve something like this.

Thanks
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42374
    
  64
Check out the HttpUnit and jWebUnit libraries for far more convenient programmatic web site processing than the previously suggested approaches (these are the libraries Henry alluded to :-)


Ping & DNS - my free Android networking tools app
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Karthik Sanghvi wrote:
If we still have privacy issues here. I will scrap the idea of scraping this particular website page.


I'm not saying scrap it, and I'm not saying they won't allow you to do it. I'm just saying that what you provide doesn't show that they do allow it, and it's up to you to do the research needed to find out for sure.

The main aim/purpose of my question is how can I go any further to achieve something like this.


Well, that's kind of a step backwards then. With a broad, vague requirement such as "something like this," the answers for general technical approaches have already been provided--use the API or webservice provided by the site if possible, otherwise use one of the libraries suggested.
 
Don't get me started about those stupid light bulbs.
 
subject: Webpage Scrapping in Java