Win a copy of Svelte and Sapper in Action this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

Webpage Scrapping in Java

 
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi There,

I am scratching my head with this.

I want to get list of agencies from this webpage

[SOME WEBSITE DELETED]

I only want to use Java. How can I achieve this?

Thanks
dexter
 
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What are you actually trying to accomplish? You need to figure out if page scraping is the best approach to whatever that is.

If it turns out the answer to that question is "yes," then you need to check out the terms of use of that site and see if they actually permit that. If not, you'll have to find a different approach.
 
author
Posts: 23883
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Sorry for deleting the reference to the website -- there was no reason to use the example, and it felt like spam with it.

Henry
 
Marshal
Posts: 67451
173
Mac Mac OS X IntelliJ IDE jQuery Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Scraping is so 1990's. If the website wants to allow you to grab their data, they'll have provided an API to do so -- a web service, JavaScript API, Java API, and so on.

If not, then you'd be stealing content and that is not something that CodeRanch can help you with. It's unethical and illegal in some areas.
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
well, I want to to scrape this website for agent phone number and email addresses using java.

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.

Basically I want to try this task using java. I saw the page source of this website and it uses JSDL.
will that help me anyhow?


Thanks
 
Bear Bibeault
Marshal
Posts: 67451
173
Mac Mac OS X IntelliJ IDE jQuery Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you intend to use this for personal use only?
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Absolutely YES!

I am a java beginner and want to achieve this task with java.

Thanks
 
Author and all-around good cowpoke
Posts: 13078
6
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The basic tool for screen scraping is the HttpURLConnection in the java.net package.

With the URL for the target page you can open a connection and getInputStream() to get a readable stream of characters.

Now, in the old days, when HTML pages were simple, that was enough. Now pages may be made up of many parts from various places, so the very first thing to do is examine the makeup of the page you want to scrape with a browser having a developer plugin. Say Firefox with the Firebug plugin.

Hopefully it will turn out to be simple but no guarantee.

Bill

 
Henry Wong
author
Posts: 23883
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Karthik Sanghvi wrote:Absolutely YES!

I am a java beginner and want to achieve this task with java.



I know people are going to recommend third-party open-source packages for this... so, I am going to mention the java.net.URL class, which is built into java. (EDIT: looks like William recommended the same thing, just minutes earlier (darn!!))

Many years ago, I used it to do simple scrapping. And after many years of adding features, GET/POST calls, SSL support, etc. etc. etc., I find that I could do everything with it. So, if you don't want to add a jar file (open source or otherwise), the URL class works well.

Henry
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Karthik Sanghvi wrote:well, I want to to scrape this website for agent phone number and email addresses using java.



And that can't be accomplished just as well or better some other way? Such as just viewing the site in a browser, or using an API or WebService that they may have provided, as mentioned already?

The Terms and conditions of this website clearly says in point
8. Copyright sub-section 8.2 You may download information from this Website for your own personal use only
And that I acknowledge that you do not acquire any ownership rights by downloading copyright material.



That still doesn't mean it's okay for you to scrape it. That just basically says that you can't redistribute the content and that you don't own it. It doesn't mean that all forms of personal use are okay, only that all other forms of use are not okay.
 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This particular program scrapes the tile of the webpage, how can I get to the agent name, etc.

 
Karthik Sanghvi
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK Jeff,

If we still have privacy issues here. I will scrap the idea of scraping this particular website page.
The main aim/purpose of my question is how can I go any further to achieve something like this.

Thanks
 
Rancher
Posts: 43016
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Check out the HttpUnit and jWebUnit libraries for far more convenient programmatic web site processing than the previously suggested approaches (these are the libraries Henry alluded to :-)
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Karthik Sanghvi wrote:
If we still have privacy issues here. I will scrap the idea of scraping this particular website page.



I'm not saying scrap it, and I'm not saying they won't allow you to do it. I'm just saying that what you provide doesn't show that they do allow it, and it's up to you to do the research needed to find out for sure.

The main aim/purpose of my question is how can I go any further to achieve something like this.



Well, that's kind of a step backwards then. With a broad, vague requirement such as "something like this," the answers for general technical approaches have already been provided--use the API or webservice provided by the site if possible, otherwise use one of the libraries suggested.
 
No matter how many women are assigned to the project, a pregnancy takes nine months. Much longer than this tiny ad:
the value of filler advertising in 2020
https://coderanch.com/t/730886/filler-advertising
    Bookmark Topic Watch Topic
  • New Topic