I have a question regarding the processing of huge resultset in databases. I am using Oracle.
Say I want to do processing of all the rows of a particular table and the table contains 10s of mllions of rows and the result should be in sorted as per some column.
I think the way it happens (I may be wrong , please correct me) is that , read lock of all the tuples are acquired , the sorting is done and then the result is dumped in some space in the server's memory (+ may be some disk , as momory might not be sufficient) and then all the read locks are released for other sessions , then a batch of rows are returned to the user in a resultset , thereafter the caller would get batches of results from the resultset based on the fetch size set which extracting the result set.
This might led the server to run out of resources and the server might not take any more work from other sessions. What would be the preferred way of doing such activity involving huge number of rows from a table.
Oracle doesn't use read locks at all - only write locks, and write locks don't block reads (only writes). Also, Oracle never escalates locks - so a row lock will never become a page/table lock (it stores locks with the data to achieve this).
Oracle's concurrency mechanism is based on multiversioning - the database reconstructs the data as it looked like at the time of the start of the command (or the time of the start of the transaction in case of serializable/read only transactions) to ensure consistency. The past version of the data is reconstructed using undo. So basically you need to size your undo so that it can keep enough data to see all the way back to the oldest running query/transaction (the sizing depends on the rate of changes made to the database as well, of course). If Oracle is not able to reconstruct the old data, it reports the ORA-01555 Snapshot Too Old error code. There are other factors that play into sizing the undo tablespace, but if you need them, I'd suggest reading the documentation.
Sorting and fetching tens of millions of rows is certainly going to be resource intensive (in IO and CPU), but it should not exhaust memory - the sorting will be done on disk, and - as already mentioned - won't consume any precious locks.
Joined: Nov 29, 2005
Thanks a lot Martin for the response. I wasn't aware that Oracle uses MVCC.
I am still struck with my question , as if I have to process millions of rows , one after other , a processing of a single row takes some time , then it would not be possible (or not a good approach) to open a resultset and process it , as the result set might have to be open for days and undo space might run out. It also looks kind of bad design to have such a long running transaction. So how do we fix this problem.
One approach my friend did and it worked was to open the result set and write the details (details which are required to process further) from the rows to a file and then quickly close the result set and then read the file and do the processing without the need of db.It has worked.
I somehow feel that the approach of off loading data from one data source to another can be avoided.Do you have any advice on this ?
Having a resultset open for days really doesn't sound like a good idea. I'd go for a few hours, but anything that cannot be completed overnight is probably too impractical.
Your proposed solution can certainly work, although it obviously bears some performance penalty. Depending on your concrete requirements, some other approaches might be possible:
1) Move the processing to the database. You can use PL/SQL or Java stored procedures, and perhaps do some or even most of the processing in SQL itself. That would be the best, as something that can be done in SQL is very probably going to be much faster and less resource intensive than any procedural solution.
2) Increase the processing speed - assuming the processing takes significantly longer time than fetching, you might be able to cut processing time significantly by distributing fetched rows among several worker threads for processing. A producer/consumer pattern would be useful here.
3) Read the data from the database in several batches using pagination query. This might not work if the source data can change during the processing, as you wouldn't have a consistent view of all the rows. Also, if you really need to process rows in a defined order, the pagination queries will get very expensive towards the end of the processing. Some more involved pagination/partitioning strategies might help. Also, if you could process the rows in arbitrary order, you might use Oracle's mechanism to split the table into ranges based on rowids, making the cost of the queries constant over the whole processing range.
Joined: Nov 29, 2005
Thanks a lot for the response. As the processing of the rows involve calling a external web service and other network io related operation.We have implemented the producer consumer pattern to process the number of tuples in parallel , but again those are also bounded by the cup core, might not be sufficient for our case.
We cannot go for the third solution as the state of db is changing continuously and very fast rate.
May be offloading is good fit as of now. But I am now considering writing the tuples to JMS queue and then let a large number of consumer threads consume/process those messages and ack. It would not be bounded by the cup cores as we can use multiple consumer system.