• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

my webcrawler throws: java.lang.OutOfMemoryError: Java heap space

 
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

i wrote a webcrawler which should download websites.
The problem is, that the heap size rises and rises... and i dont know why.

The exception occured in the following codeblock:


I use http://andreas-hess.info/programming/webcrawler/index.html with some changes (db backend, some fixes when .substring() occurs etc.) but i still get that exception:

extracted) [4] [53963/919829]: http://de.docs.yahoo.com/copyright.html
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
Exception in thread "Thread-47" java.lang.OutOfMemoryError: Java heap space
at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:55)
at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:95)
at org.apache.http.entity.BufferedHttpEntity.<init>(BufferedHttpEntity.java:60)
at ie.moguntia.webcrawler.PSuckerThread.process(PSuckerThread.java:89)
at ie.moguntia.threads.ControllableThread.run(ControllableThread.java:46)
(extracted) [4] [53963/919829]: http://info.yahoo.com/legal/de/yahoo/tos.html
(extracted) [4] [53963/919829]: http://de.docs.yahoo.com/sicherheitscenter/



As you may noticed, i have about 1 million links, and i think it will be about 10 million links and more, so i have to take care about each object not to get my heap filled.

Can someone tell me what i can do to get rid of this exception? I cannot use the DefaultHttpClient as a static class because it is a highly multithreaded application.

Thanks in advance
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Maybe you need to change your architecture to avoid accumulating links in memory - given the rapidly multiplying nature of highly linked pages web crawlers will always run out of memory if you try to keep links all in memory while following them all up.

How many sites do you want to "download" at one time?

Bill


 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm already using postgresql in the backend to store the already downloaded links (was a String Vector in the old version, the db is working like a stack now with some extras like a search function to check if the link has already been downloaded), the files are downloaded to the hdd. The depth of recursion is 4, where (eg.) yahoo.com is 0, all subdomains from yahoo are 1, all links from 1 to another website are 2 and so on... so i assume about 10 - 20 million links in the first run. The next run would be wikipedia, which should lead to much more links.

The only thing i have in my memory should be the ArrayList of domains like de.yahoo.com, de.shopping.yahoo.com etc. - but the size is about 6000 and the exception occurs, so i'm pretty sure thats not the problem.

At the moment i have 15 Threads.

The architecture should be ok i think:

-> Download the link/website -> extract all links from that page and push them in the db, also mark the link as "downloaded" in the db -> pop one link from the db and start again

I'm not sure whats happening behind the scenes with the mentioned code block, because this is where the crash occured
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would certainly try with fewer Threads and work up.

How are you handling DB connections? Not properly disposing of DB related objects is a big cause of running out of memory.

Bill
 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I created the db connection in the main class. Its a static field which is accessed by the synchronized methods push and pop.
I already tried it with 1 thread, but the same Exception occurs. I'm pretty sure the problem is in the mentioned code block.
 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap

org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176



Any ideas how i could fix this?
 
Bartender
Posts: 6663
5
MyEclipse IDE Firefox Browser Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

olze oli wrote:I just analyzed the heap and i found out that its really the code block i mentioned:
second number is the reserved size in kb on the heap

org.apache.http.util.ByteArrayBuffer#34 70.591.976
byte[]#1691 70.591.960
org.apache.http.util.ByteArrayBuffer#31 28.677.736
byte[]#1689 28.677.720
ie.moguntia.webcrawler.URLQueue#1 2.085.694
java.util.LinkedList#3 2.085.578
java.util.LinkedList$Entry#3 2.085.558
org.apache.http.util.ByteArrayBuffer#21 1.048.608
byte[]#1633 1.048.592
java.util.ArrayList#1 868.308
java.lang.Object[]#410 868.288
org.apache.http.util.ByteArrayBuffer#18 524.320
org.apache.http.util.ByteArrayBuffer#15 524.320
org.apache.http.util.ByteArrayBuffer#7 524.320
org.apache.http.util.ByteArrayBuffer#4 524.320
byte[]#1627 524.304
byte[]#1621 524.304
byte[]#1609 524.304
byte[]#1608 524.304
org.apache.http.util.ByteArrayBuffer# 262.176



Any ideas how i could fix this?



How about running this through visual VM instead ? That will tell you where the leak is occurring and which object type is contributing to the leak
 
olze oli
Ranch Hand
Posts: 187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just found this tool last week. And yes i found the problem, it was that .substring() thing which lead to the heap exception. fixed with String a = new String(x.substring...
 
reply
    Bookmark Topic Watch Topic
  • New Topic