aspose file tools*
The moose likes Beginning Java and the fly likes Problem with fetching Web pages Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Problem with fetching Web pages" Watch "Problem with fetching Web pages" New topic
Author

Problem with fetching Web pages

David Muller
Greenhorn

Joined: Dec 08, 2009
Posts: 2
So I made a simple Web crawler which works very well until at some point after fetching a few hundred pages causes an OutOfMemoryError. More specifically, the next() method of a scanner object does so. I've tried everything from forcing garbage collection to shaking my laptop pretty hard to make it work, but I just couldn't figure it out, but I'm sure it's something pretty stupid.
I would be immensely thankful if someone could help me with this and pay this person a virtual beer.

Here's the piece of code that doesn't work:


David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

If you're keeping references to the objects with a "lot" of data, eventually you'll run out of memory--that's just the way it is. You could either clean up more than you currently are, allocate more memory to the JVM, or figure out ways to conserve memory while still retaining all the data you need.
David Muller
Greenhorn

Joined: Dec 08, 2009
Posts: 2
Thanks for taking the time to reply.
My program doesn't store all the fetched pages, just a select few, and I encountered the same problem when I ran it without storing anything, so the problem is definitely in this method.
What exactly do you mean by cleaning up references, setting them to null when I'm done with them? Shouldn't this be done automatically by garbage collection?
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

GC is non-deterministic, meaning it may or may not happen. It'd be unusual if it *didn't* happen when the JVM was running low. Setting references to null can help, but everything in that method is local, so goes out of scope when the method ends.

It's *possible* there are memory leaks in, say, the Scanner class... but I don't know how *probable* it is. I get nervous when you say "the problem is definitely in this method": how have you proven that? If you have code that does *nothing* but run this method does the program still throw an OOME? How many URLs does it take before it blows up? If you run it with the same list of blow-uppy URLs does it always blow up on the same one? Have you checked with the visual JVM (Java 6+) to see if that helps identify what's keeping memory? Have you searched the web for Scanner memory leak bugs?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Problem with fetching Web pages