i wrote a crawler which runs fine (now). It uses a HashSet where it stores a URL like http://www.google.com, creates a table from the host (www.google.com) and inserts the link with a field "isProcessed" which is a boolean.
My problem is that the crawler gets pretty slow when i get up to ~1 million links, so i started the profiler and found out that the bottleneck is VisibleBufferedInputStream.readMore(int) which is part of the JDBC postgresql driver.
I use some selects to check if the link i'm currently checking is already listed in the db to prevent duplicates. Maybe someone can give me some hints like "dont use select * from... use select field1, field2... from..."
I have no idea how i could tune the database and it would really nice if it runs like in the beginning when i start my application.
It's hard to give tips on a hypothetical situation. Can you share the SQL if you want comments on it? And as Ulf noted, the explain plan will show you where the time is going. I blogged on how to do this in postgresql.
first i start with a link (get it from the db or as command line parameter) and check if the corresponding table exists:
if it doesnt, i create the table
set the link as busy
download the link and set as downloaded
then i extract all links from that downloaded html and insert them:
last, i set the processed field to true so i can exclude those from my next query
i just read that its possible to enable time tracking in the pg logs - like a verbose mode
i will try this, maybe i can find the bottleneck
edit: i forgot, i create an index after creating the table:
Joined: Jun 20, 2009
As i cannot use PreparedStatements, i decided to use plpgsql to create procedures which dont have to be compiled all the time. Because i saw in the logs that a lot of time gets wasted during parsing the statements, and they only change slightly (tablename and link). Or is this a bad idea??
Creating a table and index for each host seems inefficient. Also, it seems you could save an update if you marked the processed field true on insert. Another benefit of using one table for all hosts is that you could more easily use a prepared statement.