This week's book giveaway is in the OCPJP forum. We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line! See this thread for details.
I've looking at code that has about 9 million rows to process, and see no attempts at optimization (in the code). I say that, because I don't see any attempts at setting fetch size, or use cursors. Now: Perhaps that is because the defaults are already the most performant.
There seems to be two approaches: 1) retrieve the entire resultset to the client 2) use server-side cursors, and a fetch size
Is this statement true:
1) if the client has unlimited (ie: "enough") memory, does fetching "all at once" to the client outperform a solution where cursors are used, and multiple fetches are performed?
I would tend to think so, because you're hitting the db and network just once. Granted, the Resultset will be massive on the client, but assuming you have the memory and CPU to handle it....
I suppose when you really get *right* down to measuring seconds in an hours-long process, perhaps there's a performance benefit (or only a perceived one?) to doing fetches. That being: You can start to processing the resultset much faster (after the first fetch) rather than waiting for it all to traverse. But... it's not like the driver is doing a background fetch, right? So now it's a discussion between "wait a long while, then never again" vs "wait less time, but many times over".
My database is Sybase: from googling, I think this tends to matter.
So.. I'm thinking that the Oracle speed-boost is really only about " *IF* you are using cursors, then set your batch size correctly" But if I'm not using that, then I'm using a faster/fastest resultset possible already...
Do I have that sort of correct?
edit: changed title from: JDBC - Resultsets - What is the right 'size' for best performance? [ May 26, 2008: Message edited by: Mike Fourier ]
I'm stuck between replying "most unhelpful reply ever" and "ask a stupid question..." (but mostly that second thing). So all of this, is with a certain amount of chagrin. Perhaps there's just no "smart" way to ask this question...
The title of my post was totally misleading. I wasn't actually ever asking for what the batch size was supposed to be. For that, I already know the answer ("it depends"). Anyone that asks "what should my batch size be?" does deserve "depends". But even then, the *generic* advice is (seemingly) "between 1/4 and 1/2 the size of the expected result size." So clearly (to some people) there seems to be some general advice one could give. And, of course, all the usual caveats apply to general advice (that being: it is *general* advice, and one's mileage will vary).
Another example of general advice that could be given, even though everyone will have different exact experiences: Given an expected resultset in the millions, then someone saying "In general, a batch size of 100 will be more performant than a batch size of 10" would not be incorrect. They would, in general, be giving good advice.
Is there not some similar statement that one can make that answers the question: "do server side cursors generally underperform client-side resultsets?" (which is what I was, in my befuddled post, asking).
As for trying it out myself: Yes, I'll be doing that. And if I switch to a cursor and batchsize, I will be testing what is the 'best' size for my hardware / schema / network, etc, etc. But what I was looking for was an experienced person to say something like:
a) "know what? forget it. you're already using the best way" or b) "I've had mixed results, you'll really have to try both" or c) "batches are, in my experience, always faster for large resultsets"
Something to let me know if the time involved is even worth while. (How much time could it be??) Well... the current operation takes longer than my day at work. So testing several scenarios (batch sizes) would take several days. It also does a number of our test server, which other people share.
Mike, I choose "c". However, note that the default is not one big batch. It is database dependent. I think it is 20 or 100 rows at a time on one database.
You can test out what will help with way less than 9 million rows. Try to come up with an example of 10,000 rows and try some different scenarios. This is big enough to give you some basic data without taking all day to run.
Also, have you confirmed that everything else is optimized? For example, returning even one unused column results in a tremendous amount of network traffic.