aspose file tools*
The moose likes Performance and the fly likes jdbc - common problems/mistakes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "jdbc - common problems/mistakes" Watch "jdbc - common problems/mistakes" New topic
Author

jdbc - common problems/mistakes

olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
hi,

i wrote a crawler which runs fine (now). It uses a HashSet where it stores a URL like http://www.google.com, creates a table from the host (www.google.com) and inserts the link with a field "isProcessed" which is a boolean.
My problem is that the crawler gets pretty slow when i get up to ~1 million links, so i started the profiler and found out that the bottleneck is VisibleBufferedInputStream.readMore(int) which is part of the JDBC postgresql driver.

I use some selects to check if the link i'm currently checking is already listed in the db to prevent duplicates. Maybe someone can give me some hints like "dont use select * from... use select field1, field2... from..."
I have no idea how i could tune the database and it would really nice if it runs like in the beginning when i start my application.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42934
    
  68
Do you have indexes in place that get used for those SELECTs? You may want to familiarize yourself with Postgres' EXPLAIN functionality that shows you the execution plan for the query in question.
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 31075
    
232

Olze,
It's hard to give tips on a hypothetical situation. Can you share the SQL if you want comments on it? And as Ulf noted, the explain plan will show you where the time is going. I blogged on how to do this in postgresql.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
thanks for that hint with explain.

first i start with a link (get it from the db or as command line parameter) and check if the corresponding table exists:


if it doesnt, i create the table


set the link as busy


download the link and set as downloaded


then i extract all links from that downloaded html and insert them:


last, i set the processed field to true so i can exclude those from my next query


i just read that its possible to enable time tracking in the pg logs - like a verbose mode
i will try this, maybe i can find the bottleneck


edit: i forgot, i create an index after creating the table:

olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
As i cannot use PreparedStatements, i decided to use plpgsql to create procedures which dont have to be compiled all the time. Because i saw in the logs that a lot of time gets wasted during parsing the statements, and they only change slightly (tablename and link). Or is this a bad idea??
kri shan
Ranch Hand

Joined: Apr 08, 2004
Posts: 1382
What is crawler ? Is it open source tool ?
steve souza
Ranch Hand

Joined: Jun 26, 2002
Posts: 861
Creating a table and index for each host seems inefficient. Also, it seems you could save an update if you marked the processed field true on insert. Another benefit of using one table for all hosts is that you could more easily use a prepared statement.


http://www.jamonapi.com/ - a fast, free open source performance tuning api.
JavaRanch Performance FAQ
olze oli
Ranch Hand

Joined: Jun 20, 2009
Posts: 148
that crawler is not opensource. if it would be, i would have posted the source ;)

that crawler has to handle really a lot of links and hosts - its not possible to do it in any other way.
a link is processed when it is downloaded and processed, not before these 2 steps.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: jdbc - common problems/mistakes