aspose file tools*
The moose likes Java in General and the fly likes Screen Scrapping Problem! Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Screen Scrapping Problem!" Watch "Screen Scrapping Problem!" New topic
Author

Screen Scrapping Problem!

t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
Afternoon,

I'm currently having a problem with a screen scrapping project. Heres my dillema: My program executes perfectly, I can scrape all the HTML off of a website, however,
I'm attempting to somehow sort through the HTML and JUST pull out Game Scores. I'm thinking that I need to use some type of String Class Method? Heres my program so far:




The website I'm attempting to Scrap is http://www.scores.com

The HTML Tags that I want look something like this:


John Jai
Bartender

Joined: May 31, 2011
Posts: 1776

The above piece represents a single unit of team names, winner, games played and total. If you see teams and winner are grouped using an id which differs in the starting letter as1-ncaab-201111220287 and hs1-ncaab-201111220287. This way you can map a team and its winner.

When you read a line say <td class="teams">UCLA Bruins</td> , you can have a string match for "<td class="teams">" and store UCLA Bruins as a team and when you encounter a match for "<td class="teams winner">" for the corresponding id (hs1-ncaab-201111220287) you can store the winner of the team.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39409
    
  28
And welcome to the Ranch
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
I'd use a library like HtmlUnit or jWebUnit for this.
Brian Burress
Ranch Hand

Joined: Jun 30, 2003
Posts: 122
You may be able to do this with the other tools mentioned too, but I suggest you look into XSLT, specifically XPATH. I have done a number of similar things as far as scraping data from pages using Selenium WebDriver and XPath. Selenium directly allows XPath as well as other approaches for pulling data off pages, etc.

I think the learning curve on XSLT/Xpath may be a little steep, so if this is all you are going to do and HtmlUnit meets your needs, stick with it. If you plan on doing some more intensive scraping or maybe even do some scripting from page to page, etc then spending some time in the XSLT arena may be worth your time.
Mohamed Sanaulla
Saloon Keeper

Joined: Sep 08, 2007
Posts: 3071
    
  33

I think the post would better sit in this forum, hence moving it. Please continue the discussion.


Mohamed Sanaulla | My Blog
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
@John Jai

I like the way you are going with this, however, I am very unfamiliar with using the String class to search like you mentioned.
I've been trying to do some research and I came across Regular Expressions, but to no avail. Thanks again.

** I'm supposed to do this manually, without the extra HTML parser programs and what not. **
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
If you continue to go manipulating the Strings, you might end up in error prone solutions. Converting HTML to XML might be safer.

Anyways, below code might help to take the data between two start and end tags.... say the team name "UCLA Bruins" in the line <td class="teams">UCLA Bruins</td>. You have to do String comparison to check which information is present in the currently parsed line.

Suppose the below are the contents of the text file test.txt -


Below code will parse and give you information between the tags. Note a simple regex is used.



Sooner or later when you have your solution, you have to store the team information. Try to use a separate class for that purpose, instead of using multiple strings. Below given a sample information storage class (add getters / setters). Create different objects and store information in them.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

John Jai wrote:Converting HTML to XML might be safer.

I think that's the way I would go too. Regexes are powerful, but not generally the best choice for hierarchical parsing.

@t_day: You might want to have a look at JTidy, which is a Java port of the old chestnut HtmlTidy, that converts HTML to XHTML (well-formed, and therefore suitable for most parsers). In fact I believe it has it's own parser built in, although I haven't used it myself.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
@John Jai, Thank you, I'll be trying this very soon,

Also, I understand that this will get the team names, but how do I get the further information like the scores and what not?
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
When you parse a team information you can set a boolean to note that team is getting parsed. And when you hit a <td> tag with an integer value, you can store it in the games played.

Similarly when you parse a winner, you can set a boolean to note that winner is getting parse.d And when you hit a <td> beneath a winner, you store it in a winner's games played data.

It will be like flipping of the booleans corresponding to what information is read.

That will be tedious and hence take some time on converting the HTML into an XML first.
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
... which is why I still recommend to use a library that does all this for you.
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
I am not able to figure this one out, I've been messing with it for a while now. Any help?

Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

t dav wrote:I am not able to figure this one out, I've been messing with it for a while now. Any help?

Well, you haven't given us much to go on (check out the ItDoesntWorkIsUseless page), but your 'lineScanner' implementation, specifically the
lineScanner.useDelimiter("<(.*?)>");
bit seems a bit heavyweight to me for what you need (mind you, I loathe Scanner, so I'm not the best to judge).

You already know that your line contains the team name, and you also know that it's between the ">" that ends the 'td' tag and the next "<", so why not just use regular String methods, viz:I really fear you're getting a bit bogged down in these regexes. Sometimes the simplest is the best.

Winston
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
I just have to ask: why are you so set on programming all this tedious stuff by hand instead of relying on a ready-to-use solution like the ones I suggested? Coding something like this is a bit masochistic if you ask me.
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
Well I have been trying to read in the specific information, team names, scores, ranks, etc. However, all I'm getting in is the
complete HTML coding. Specifically, I'm stuck on figuring out how to single out each name and what not and save them.

As for what Winston just posted, could you help explain what that does a bit?

Also, the code is giving this error:

t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
@Tim, I see you suggested HTMLUnit, have you used this for something similar?
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
The ones I mentioned in my first post.

Check the javadocs of the String class for the correct spelling of the "subString" method that you're trying to use.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

t dav wrote:Also, the code is giving this error:

Ooops. Apologies; should be substring().

Winston
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
So I changed the website that I am scraping off of, the new website is: http://www.vegasinsider.com/top-betting-trends/

Scraping being the first chart you see,

heres what I have so far:




I'm running into some runtime errors, stating

Exception in thread "main" java.lang.NullPointerException
at WebGUI.<init>(WebGUI.java:75)
at WebGUI.main(WebGUI.java:225)

any opinions?
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Some reference on the specified number is pointing to null. Check the code in line #75.

Never catch NullPointerException like below



Instead check for null in the if condition.

Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

t dav wrote:So I changed the website that I am scraping off of, the new website is: http://www.vegasinsider.com/top-betting-trends/...

Hunh? You haven't even got your code working and you've already changed the site?

I'm with Tim here. The HTML for these sites is simply too complicated and too varied to be trying to pull out specific pieces of data without some sort of parser (and it's not likely to be very simple even then).
Your previous code at least had a fairly specific string (TEAMS_COLUMN) that you could rely on, but now you're just looking for ">1<". I suspect that's a non-starter, and will give you a ton of false hits.

You're also mixing your screen scraping code with your GUI. DON'T.

Write a program/class that can successfully scrape a site and display results without any Swing code at all. Once you've got that working, then add the GUI stuff.

Also, I think you need to write down a procedure for scraping a screen on paper. Right now, you're just coding like mad, and dealing with errors as you get them.
I call that "gorilla programming" (...problem...code...ugh...) - otherwise known as the Jean-Paul Sartre methodology - and is NOT the way to become a successful programmer.

Winston
t dav
Greenhorn

Joined: Nov 22, 2011
Posts: 8
I searched through the HTML Code for the page, the ">1<" is THE only one and it begins exactly at the info that I need to get.
So what I thought would be right would be to use a string array and substrings and go down each line and copy exactly what I need.

Seeing as how the HTML code is the same for what I need, I thought this would work.
Brian Burress
Ranch Hand

Joined: Jun 30, 2003
Posts: 122
I'll re-suggest that you spend some time looking at Selenium and Xpath. Selenium would let you pull out all elements by Xpath - where the Xpath is defined to pull all td elements with a class where the value is 'teams'. I think the XPath xpression would look something like "//td[contains(@class,'teams')]" (forgive me if it is not an exact match to what you need, I am not trying it out on the website you mention, just pseudo coding it).

Overall, using selenium web driver, you are talking a few lines of code to access the page and pull the elements
- using the driver class get method to open the URL you want to work with
- execute the findElements method of driver and pass it a By.xpath("") expression with the xpath in the quotes.

from there iterate, inspect, and explore the elements to start mining the data you want.

The XPath syntax does have a learning curve. I have used it along with Selenium to parse pages similar, and even more complex than what you are doing.

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Screen Scrapping Problem!