I have the following scenario: I need to find all customers who bought at least two certain books. As an example I want to find all users who bought "Sql Book" and "Java Book". For that I've got the following query:
This works fine and delivers the expected results. But: it's too slowly. I have a quite huge amount of data, so that this query needs more than 200 seconds. Could anybody help me out, to rewrite this query so that it performs faster?
PS: I have no influence on the database itself, so I can't set any index or modify the database in any other way.
Joachim, What database are you using? In particular, is it one that supports the INTERSECTION keyword? If so, it would likely be faster to do two separate queries and use that operator.
Also, are you calling this query via JDBC. If so, you could just run the two queries (matching SQL book or matching Java book) separately and do the intersection yourself in Java.
The best way to find out where the time is going is really to run a database "explain" command on your query. If it is doing a nested loop, my suggestions will help. If it is spending the time doing a table scan (which sounds more likely from your description), the only thing that will speed things up is an index on boughtBooks. If you can't change the database, you'll need to either talk to the people who can change it or the people who gave you the performance requirement.
Requires a full table scan for each row (the inner query must return a result set that reads all rows in the table before the outer query's where clause can be evaluated)... unless boughtBooks is an indexed column... which is not very likely.
[ November 14, 2008: Message edited by: Paul Campbell ] [ November 14, 2008: Message edited by: Paul Campbell ]
Thanks for your answer Paul. Your query is ca. 10% faster. In the meanwhile I talked to the responsible person and I was told it should not be a problem to set an index. I've got a question though: what exactly does "select 1" means in your query?
Originally posted by Joachim Rohde: Thanks for your answer Paul. Your query is ca. 10% faster. In the meanwhile I talked to the responsible person and I was told it should not be a problem to set an index. I've got a question though: what exactly does "select 1" means in your query?
Well... there are two answers here. The short answer is the sub-query returns a value of 1 for each row that is true... you could really use anything you like... the general practice is to "Select 1" with an Exists clause.
The more important concept is that the EXISTS clause is a boolean True/False for the sub-query (you can also use NOT EXISTS in a sub-query). The sub-query will evaluate to TRUE if any rows are returned for the sub-query (and TRUE if no rows are returned for NOT EXISTS).
Using a correlated sub-query (a query that has reference to the outer query) is usually faster when you use it in this manner than a non-correlated sub-query. A non-correlated sub-query must also perform a sort when the comparison operator is IN, NOT IN, = ANY, <> ANY, = ALL, or <> ALL.
I can clarify if I rushed through this too quickly... its a nice diversion from what I'm working on right now. [ November 15, 2008: Message edited by: Paul Campbell ]