Forum:

Performance

my webcrawler throws: java.lang.OutOfMemoryError: Java heap space

Ranch Hand

Posts: 187

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi,

i wrote a webcrawler which should download websites.
The problem is, that the heap size rises and rises... and i dont know why.

The exception occured in the following codeblock:

I use http://andreas-hess.info/programming/webcrawler/index.html with some changes (db backend, some fixes when .substring() occurs etc.) but i still get that exception:

extracted) [4] [53963/919829]: http://de.docs.yahoo.com/copyright.html
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
Exception in thread "Thread-47" java.lang.OutOfMemoryError: Java heap space
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
(extracted) [4] [53963/919829]: http://info.yahoo.com/legal/de/yahoo/tos.html
(extracted) [4] [53963/919829]: http://de.docs.yahoo.com/sicherheitscenter/

As you may noticed, i have about 1 million links, and i think it will be about 10 million links and more, so i have to take care about each object not to get my heap filled.

Can someone tell me what i can do to get rid of this exception? I cannot use the DefaultHttpClient as a static class because it is a highly multithreaded application.

Thanks in advance

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Maybe you need to change your architecture to avoid accumulating links in memory - given the rapidly multiplying nature of highly linked pages web crawlers will always run out of memory if you try to keep links all in memory while following them all up.

How many sites do you want to "download" at one time?

Bill

olze oli

Ranch Hand

Posts: 187

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I'm already using postgresql in the backend to store the already downloaded links (was a String Vector in the old version, the db is working like a stack now with some extras like a search function to check if the link has already been downloaded), the files are downloaded to the hdd. The depth of recursion is 4, where (eg.) yahoo.com is 0, all subdomains from yahoo are 1, all links from 1 to another website are 2 and so on... so i assume about 10 - 20 million links in the first run. The next run would be wikipedia, which should lead to much more links.

The only thing i have in my memory should be the ArrayList of domains like de.yahoo.com, de.shopping.yahoo.com etc. - but the size is about 6000 and the exception occurs, so i'm pretty sure thats not the problem.

At the moment i have 15 Threads.

The architecture should be ok i think:

-> Download the link/website -> extract all links from that page and push them in the db, also mark the link as "downloaded" in the db -> pop one link from the db and start again

I'm not sure whats happening behind the scenes with the mentioned code block, because this is where the crash occured

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I would certainly try with fewer Threads and work up.

How are you handling DB connections? Not properly disposing of DB related objects is a big cause of running out of memory.

Bill

olze oli

Ranch Hand

Posts: 187

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I created the db connection in the main class. Its a static field which is accessed by the synchronized methods push and pop.
I already tried it with 1 thread, but the same Exception occurs. I'm pretty sure the problem is in the mentioned code block.

olze oli

Ranch Hand

Posts: 187

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap

org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176

Any ideas how i could fix this?

Deepak Bala

Bartender

Posts: 6663

I like...

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

olze oli wrote:I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap

org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176

Any ideas how i could fix this?

How about running this through visual VM instead ? That will tell you where the leak is occurring and which object type is contributing to the leak

SCJP 6 articles - SCJP 5/6 mock exams - More SCJP Mocks

olze oli

Ranch Hand

Posts: 187

posted 13 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Just found this tool last week. And yes i found the problem, it was that .substring() thing which lead to the heap exception. fixed with String a = new String(x.substring...