• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Scrape Stock Brokerage Site with Groovy?

 
Ranch Hand
Posts: 428
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'd like to write a little app that would tally the value of multiple brokerage accounts. I'm thinking of sites like Etrade or charlles schwab where one might have an account for trading stocks.

Should I use the java.net.URL class or apache's HttpClient?

Should I use Java.net.passwordAuthentication as exemplified at http://www.java2s.com/Code/Java/Network-Protocol/javanetPasswordAuthenticationPasswordAuthenticationStringuserNamecharpassword.htm or just pass the username and password as post parameters?

I assume these issues are independent to java v. groovy.

Is there a nice GUI tool that will give me the XPATH for a desired tidbit of data in the raw HTML that I scrape?

Thanks,
Siegfried
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Is what you propose permitted by the End User Agreements of the sites you plan to scrape?
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Back in the days before "web 2.0" - when you could expect output like what you are talking about to be composed as a single HTML page - you could expect to capture the response from a single URL and get content you could "screen scrape." (A term which dates back to mainframes and terminals.)

This has not been the case for quite a while - these days what looks like a simple page could be composed from dozens of separate requests. I suggest you use something like the Firebug add-on for Firefox and take a close look at the captured conversation that creates the page.

Your sites may already provide for SOAP or RESTful requests to get formatted data you can easily use. Services like Amazon or Google have been exposing these interfaces for years.

Bill
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A library like HtmlUnit would be much easier to use than HttpClient or java.net.URL. It provides high-level methods for accessing page elements, including XPath.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic