Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Screen Scrapping Problem!

 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Afternoon,

I'm currently having a problem with a screen scrapping project. Heres my dillema: My program executes perfectly, I can scrape all the HTML off of a website, however,
I'm attempting to somehow sort through the HTML and JUST pull out Game Scores. I'm thinking that I need to use some type of String Class Method? Heres my program so far:




The website I'm attempting to Scrap is http://www.scores.com

The HTML Tags that I want look something like this:


 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

The above piece represents a single unit of team names, winner, games played and total. If you see teams and winner are grouped using an id which differs in the starting letter as1-ncaab-201111220287 and hs1-ncaab-201111220287. This way you can map a team and its winner.

When you read a line say <td class="teams">UCLA Bruins</td> , you can have a string match for "<td class="teams">" and store UCLA Bruins as a team and when you encounter a match for "<td class="teams winner">" for the corresponding id (hs1-ncaab-201111220287) you can store the winner of the team.
 
Campbell Ritchie
Sheriff
Posts: 48904
58
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
And welcome to the Ranch
 
Tim Moores
Bartender
Posts: 2789
38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd use a library like HtmlUnit or jWebUnit for this.
 
Brian Burress
Ranch Hand
Posts: 131
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You may be able to do this with the other tools mentioned too, but I suggest you look into XSLT, specifically XPATH. I have done a number of similar things as far as scraping data from pages using Selenium WebDriver and XPath. Selenium directly allows XPath as well as other approaches for pulling data off pages, etc.

I think the learning curve on XSLT/Xpath may be a little steep, so if this is all you are going to do and HtmlUnit meets your needs, stick with it. If you plan on doing some more intensive scraping or maybe even do some scripting from page to page, etc then spending some time in the XSLT arena may be worth your time.
 
Mohamed Sanaulla
Saloon Keeper
Posts: 3159
33
Google App Engine Java Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think the post would better sit in this forum, hence moving it. Please continue the discussion.
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@John Jai

I like the way you are going with this, however, I am very unfamiliar with using the String class to search like you mentioned.
I've been trying to do some research and I came across Regular Expressions, but to no avail. Thanks again.

** I'm supposed to do this manually, without the extra HTML parser programs and what not. **
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you continue to go manipulating the Strings, you might end up in error prone solutions. Converting HTML to XML might be safer.

Anyways, below code might help to take the data between two start and end tags.... say the team name "UCLA Bruins" in the line <td class="teams">UCLA Bruins</td>. You have to do String comparison to check which information is present in the currently parsed line.

Suppose the below are the contents of the text file test.txt -


Below code will parse and give you information between the tags. Note a simple regex is used.



Sooner or later when you have your solution, you have to store the team information. Try to use a separate class for that purpose, instead of using multiple strings. Below given a sample information storage class (add getters / setters). Create different objects and store information in them.
 
Winston Gutkowski
Bartender
Pie
Posts: 10417
63
Eclipse IDE Hibernate Ubuntu
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
John Jai wrote:Converting HTML to XML might be safer.

I think that's the way I would go too. Regexes are powerful, but not generally the best choice for hierarchical parsing.

@t_day: You might want to have a look at JTidy, which is a Java port of the old chestnut HtmlTidy, that converts HTML to XHTML (well-formed, and therefore suitable for most parsers). In fact I believe it has it's own parser built in, although I haven't used it myself.

Winston
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@John Jai, Thank you, I'll be trying this very soon,

Also, I understand that this will get the team names, but how do I get the further information like the scores and what not?
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you parse a team information you can set a boolean to note that team is getting parsed. And when you hit a <td> tag with an integer value, you can store it in the games played.

Similarly when you parse a winner, you can set a boolean to note that winner is getting parse.d And when you hit a <td> beneath a winner, you store it in a winner's games played data.

It will be like flipping of the booleans corresponding to what information is read.

That will be tedious and hence take some time on converting the HTML into an XML first.
 
Tim Moores
Bartender
Posts: 2789
38
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
... which is why I still recommend to use a library that does all this for you.
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not able to figure this one out, I've been messing with it for a while now. Any help?

 
Winston Gutkowski
Bartender
Pie
Posts: 10417
63
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
t dav wrote:I am not able to figure this one out, I've been messing with it for a while now. Any help?

Well, you haven't given us much to go on (check out the ItDoesntWorkIsUseless page), but your 'lineScanner' implementation, specifically the
lineScanner.useDelimiter("<(.*?)>");
bit seems a bit heavyweight to me for what you need (mind you, I loathe Scanner, so I'm not the best to judge).

You already know that your line contains the team name, and you also know that it's between the ">" that ends the 'td' tag and the next "<", so why not just use regular String methods, viz:I really fear you're getting a bit bogged down in these regexes. Sometimes the simplest is the best.

Winston
 
Tim Moores
Bartender
Posts: 2789
38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just have to ask: why are you so set on programming all this tedious stuff by hand instead of relying on a ready-to-use solution like the ones I suggested? Coding something like this is a bit masochistic if you ask me.
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well I have been trying to read in the specific information, team names, scores, ranks, etc. However, all I'm getting in is the
complete HTML coding. Specifically, I'm stuck on figuring out how to single out each name and what not and save them.

As for what Winston just posted, could you help explain what that does a bit?

Also, the code is giving this error:

 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Tim, I see you suggested HTMLUnit, have you used this for something similar?
 
Tim Moores
Bartender
Posts: 2789
38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The ones I mentioned in my first post.

Check the javadocs of the String class for the correct spelling of the "subString" method that you're trying to use.
 
Winston Gutkowski
Bartender
Pie
Posts: 10417
63
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
t dav wrote:Also, the code is giving this error:

Ooops. Apologies; should be substring().

Winston
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So I changed the website that I am scraping off of, the new website is: http://www.vegasinsider.com/top-betting-trends/

Scraping being the first chart you see,

heres what I have so far:




I'm running into some runtime errors, stating

Exception in thread "main" java.lang.NullPointerException
at WebGUI.<init>(WebGUI.java:75)
at WebGUI.main(WebGUI.java:225)

any opinions?
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some reference on the specified number is pointing to null. Check the code in line #75.

Never catch NullPointerException like below



Instead check for null in the if condition.

 
Winston Gutkowski
Bartender
Pie
Posts: 10417
63
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
t dav wrote:So I changed the website that I am scraping off of, the new website is: http://www.vegasinsider.com/top-betting-trends/...

Hunh? You haven't even got your code working and you've already changed the site?

I'm with Tim here. The HTML for these sites is simply too complicated and too varied to be trying to pull out specific pieces of data without some sort of parser (and it's not likely to be very simple even then).
Your previous code at least had a fairly specific string (TEAMS_COLUMN) that you could rely on, but now you're just looking for ">1<". I suspect that's a non-starter, and will give you a ton of false hits.

You're also mixing your screen scraping code with your GUI. DON'T.

Write a program/class that can successfully scrape a site and display results without any Swing code at all. Once you've got that working, then add the GUI stuff.

Also, I think you need to write down a procedure for scraping a screen on paper. Right now, you're just coding like mad, and dealing with errors as you get them.
I call that "gorilla programming" (...problem...code...ugh...) - otherwise known as the Jean-Paul Sartre methodology - and is NOT the way to become a successful programmer.

Winston
 
t dav
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I searched through the HTML Code for the page, the ">1<" is THE only one and it begins exactly at the info that I need to get.
So what I thought would be right would be to use a string array and substrings and go down each line and copy exactly what I need.

Seeing as how the HTML code is the same for what I need, I thought this would work.
 
Brian Burress
Ranch Hand
Posts: 131
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'll re-suggest that you spend some time looking at Selenium and Xpath. Selenium would let you pull out all elements by Xpath - where the Xpath is defined to pull all td elements with a class where the value is 'teams'. I think the XPath xpression would look something like "//td[contains(@class,'teams')]" (forgive me if it is not an exact match to what you need, I am not trying it out on the website you mention, just pseudo coding it).

Overall, using selenium web driver, you are talking a few lines of code to access the page and pull the elements
- using the driver class get method to open the URL you want to work with
- execute the findElements method of driver and pass it a By.xpath("") expression with the xpath in the quotes.

from there iterate, inspect, and explore the elements to start mining the data you want.

The XPath syntax does have a learning curve. I have used it along with Selenium to parse pages similar, and even more complex than what you are doing.

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic