File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes my webcrawler throws: java.lang.OutOfMemoryError: Java heap space Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "my webcrawler throws: java.lang.OutOfMemoryError: Java heap space " Watch "my webcrawler throws: java.lang.OutOfMemoryError: Java heap space " New topic
Author

my webcrawler throws: java.lang.OutOfMemoryError: Java heap space

olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
Hi,

i wrote a webcrawler which should download websites.
The problem is, that the heap size rises and rises... and i dont know why.

The exception occured in the following codeblock:


I use http://andreas-hess.info/programming/webcrawler/index.html with some changes (db backend, some fixes when .substring() occurs etc.) but i still get that exception:
extracted) [4] [53963/919829]: http://de.docs.yahoo.com/copyright.html
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
Exception in thread "Thread-47" java.lang.OutOfMemoryError: Java heap space
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
(extracted) [4] [53963/919829]: http://info.yahoo.com/legal/de/yahoo/tos.html
(extracted) [4] [53963/919829]: http://de.docs.yahoo.com/sicherheitscenter/


As you may noticed, i have about 1 million links, and i think it will be about 10 million links and more, so i have to take care about each object not to get my heap filled.

Can someone tell me what i can do to get rid of this exception? I cannot use the DefaultHttpClient as a static class because it is a highly multithreaded application.

Thanks in advance
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Maybe you need to change your architecture to avoid accumulating links in memory - given the rapidly multiplying nature of highly linked pages web crawlers will always run out of memory if you try to keep links all in memory while following them all up.

How many sites do you want to "download" at one time?

Bill


olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
I'm already using postgresql in the backend to store the already downloaded links (was a String Vector in the old version, the db is working like a stack now with some extras like a search function to check if the link has already been downloaded), the files are downloaded to the hdd. The depth of recursion is 4, where (eg.) yahoo.com is 0, all subdomains from yahoo are 1, all links from 1 to another website are 2 and so on... so i assume about 10 - 20 million links in the first run. The next run would be wikipedia, which should lead to much more links.

The only thing i have in my memory should be the ArrayList of domains like de.yahoo.com, de.shopping.yahoo.com etc. - but the size is about 6000 and the exception occurs, so i'm pretty sure thats not the problem.

At the moment i have 15 Threads.

The architecture should be ok i think:

-> Download the link/website -> extract all links from that page and push them in the db, also mark the link as "downloaded" in the db -> pop one link from the db and start again

I'm not sure whats happening behind the scenes with the mentioned code block, because this is where the crash occured
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
I would certainly try with fewer Threads and work up.

How are you handling DB connections? Not properly disposing of DB related objects is a big cause of running out of memory.

Bill
olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
I created the db connection in the main class. Its a static field which is accessed by the synchronized methods push and pop.
I already tried it with 1 thread, but the same Exception occurs. I'm pretty sure the problem is in the mentioned code block.
olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap
org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176


Any ideas how i could fix this?
Deepak Bala
Bartender

Joined: Feb 24, 2006
Posts: 6661
    
    5

olze oli wrote:I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap
org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176


Any ideas how i could fix this?


How about running this through visual VM instead ? That will tell you where the leak is occurring and which object type is contributing to the leak


SCJP 6 articles - SCJP 5/6 mock exams - More SCJP Mocks
olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
Just found this tool last week. And yes i found the problem, it was that .substring() thing which lead to the heap exception. fixed with String a = new String(x.substring...
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: my webcrawler throws: java.lang.OutOfMemoryError: Java heap space
 
Similar Threads
putting images in java applet using swing, which is coded in netbeans IDE.
startNodeManager.sh and startWeblogic.sh doesn't start to execute
Download large files in RMI
"Image Fetcher" Threads??
Java heap space (Resolved!)