i want to download as much websites as possible. I used http://andreas-hess.info/programming/webcrawler/index.html this tool in the beginning and started modifying it heavily. As a result, it is running, but somehow it sometimes just does nothing (checked the tcp states, threads, heap space etc. and everything looks fine to me) so i started searching around the internet if there is a crawler which can handle _many_ links (eg. yahoo.com on depth 4 or 5 which leads to 10.000.000 links or more) with database support in the backend.
i found this posting here http://www.coderanch.com/t/519409/Servlets/java/crawler but the crawler listed there does not check if the memory gets in a critical state. This will always happen (for example) if .substring() is used without new String()... or if the crawler uses an ArrayList or some other kind of List which stores the links, because (thats what i think) no List can hold a few million links in it. This is what i found out during my researches on my crawler.
So if anyone can recommend a crawler which works withoug throwing OOME, it would be nice to let me know.
No crawler can guarantee to never throw an OOME. The more links are crawled and stored, the more memory is required. After a while the JVM doesn't have any more available. You can configure how much is available using the -Xmx flag, although you're still limited to 1.5GB on a 32bit JVM installation (at least on Windows).
Yeah thats probably true, and software will never be perfect, and thats ok for me.
But i'm using a database in the backend, so there should be a way to accomplish this goal. At the moment i created a crawler the following way:
1) create the database "crawler"
2) check the given url if its valid
3) get the host of the url and create a table (if not exist) with the same name (busy, processed, downloaded, level, url)
4) add the given host to a static hashmap
5) download the url and process it (get all urls from that html site) and foreach url do step 2, 3, 4 and 5
6) go through the hashmap and check if there's a host which has elements in level (0...x) that are not processed - if so get the url and do step 2,3,4,5
this is working pretty fine, currently i have nearly 2 million links, i can crawl over yahoo, but unfortunately somewhere is probably a bug, because this night the process stopped somewhere... as i started the debugger to check where the threads are waiting they started working again and i have no idea what went wrong. thats the reason why i'm looking for some professional solution, if there is some.
Is there any particular reason why you're not basing your software on proven crawl infrastructure like Apache Nutch?
Joined: Jun 20, 2009
i know that i tested nutch, and i know it was not adequate for me, but please dont ask me why - its too long ago.
I'm currently testing Heritrix which looks pretty good to me.
Thats what i'm looking for, proven software which is able to handle such amount of data. I will Heritrix run until monday, and then see whats happening ;)
Joined: Jun 20, 2009
the crawler stopped working yesterday morning, it was not a heap problem, it was a problem with the file handles (too many open files)
heritrix 3 stopped after about 12MB, no idea why... the webinterface say its running, the report say its terminated abnormal.
The operating system only allows a certain number of open file* handles at a time. If you exceed this number it will simply refuse it, and you will get an exception in Java. That's why streams that are no longer needed should always be closed, preferably in a finally block. In case of a multi-threaded approach you must limit the number of concurrent threads.
* don't take the "file" part literally, it also includes handles to network connections and many other things.
Joined: Jun 20, 2009
i opened a bugreport for heritrix 1.x because i think this is not normal behavior
i really thought heritrix solves my problem
does anyone know another good crawler which can handle some gb of data? i really liked that "snapshot"/"pause/resume" thing from heritrix