Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Problem with fetching Web pages

 
David Muller
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So I made a simple Web crawler which works very well until at some point after fetching a few hundred pages causes an OutOfMemoryError. More specifically, the next() method of a scanner object does so. I've tried everything from forcing garbage collection to shaking my laptop pretty hard to make it work, but I just couldn't figure it out, but I'm sure it's something pretty stupid.
I would be immensely thankful if someone could help me with this and pay this person a virtual beer.

Here's the piece of code that doesn't work:


 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you're keeping references to the objects with a "lot" of data, eventually you'll run out of memory--that's just the way it is. You could either clean up more than you currently are, allocate more memory to the JVM, or figure out ways to conserve memory while still retaining all the data you need.
 
David Muller
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for taking the time to reply.
My program doesn't store all the fetched pages, just a select few, and I encountered the same problem when I ran it without storing anything, so the problem is definitely in this method.
What exactly do you mean by cleaning up references, setting them to null when I'm done with them? Shouldn't this be done automatically by garbage collection?
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
GC is non-deterministic, meaning it may or may not happen. It'd be unusual if it *didn't* happen when the JVM was running low. Setting references to null can help, but everything in that method is local, so goes out of scope when the method ends.

It's *possible* there are memory leaks in, say, the Scanner class... but I don't know how *probable* it is. I get nervous when you say "the problem is definitely in this method": how have you proven that? If you have code that does *nothing* but run this method does the program still throw an OOME? How many URLs does it take before it blows up? If you run it with the same list of blow-uppy URLs does it always blow up on the same one? Have you checked with the visual JVM (Java 6+) to see if that helps identify what's keeping memory? Have you searched the web for Scanner memory leak bugs?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic