Scrape Stock Brokerage Site with Groovy?

Ranch Hand

Posts: 428

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I'd like to write a little app that would tally the value of multiple brokerage accounts. I'm thinking of sites like Etrade or charlles schwab where one might have an account for trading stocks.

Should I use the java.net.URL class or apache's HttpClient?

Should I use Java.net.passwordAuthentication as exemplified at http://www.java2s.com/Code/Java/Network-Protocol/javanetPasswordAuthenticationPasswordAuthenticationStringuserNamecharpassword.htm or just pass the username and password as post parameters?

I assume these issues are independent to java v. groovy.

Is there a nice GUI tool that will give me the XPATH for a desired tidbit of data in the raw HTML that I scrape?

Thanks,
Siegfried

Joe Ess

Bartender

Posts: 9626

I like...

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Is what you propose permitted by the End User Agreements of the sites you plan to scrape?

[How To Ask Questions On JavaRanch]

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Back in the days before "web 2.0" - when you could expect output like what you are talking about to be composed as a single HTML page - you could expect to capture the response from a single URL and get content you could "screen scrape." (A term which dates back to mainframes and terminals.)

This has not been the case for quite a while - these days what looks like a simple page could be composed from dozens of separate requests. I suggest you use something like the Firebug add-on for Firefox and take a close look at the captured conversation that creates the page.

Your sites may already provide for SOAP or RESTful requests to get formatted data you can easily use. Services like Amazon or Google have been exposing these interfaces for years.

Bill

Ulf Dittmer

Rancher

Posts: 43081

posted 14 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

A library like HtmlUnit would be much easier to use than HttpClient or java.net.URL. It provides high-level methods for accessing page elements, including XPath.

Don't get me started about those stupid light bulbs.