I have to run a batch program which contains 100k records to process.
The java program has to fetch 100k records from the database and process each record. Once the process is complete the java program updates the record in the database with the process status.
We are using Hibernate O/R mapping framework for all database interactions.
I am trying to figure out the best approach to tackle this. First approach is I load all 100k records via Hibernate Criteria list, carry out the java processing and then update individual record in the database with the status. But the issue I encounter is it runs into an Out of Memory space error as each record is of considerable size comprising several data columns.
The other approach I tried is to load only those columns (using Hibernate Projections) for the 100k records required for processing in java. This approach solves the 'out of space memory ' problem I run into as mentioned in the first approach but there is downside to this. Since I am only grabbing the selected column of the record object I end up with partial Hibernate entity object not good enough for persisting back to the database. So after processing each record I have to reload back the full Hibernate entity object in order carry out the update. This works but is significant performance overhead considering there will be 100k database roundtrips.
Please suggest what would be a good approach for handling the aforementioned problem.
I'm not even sure if Hibernate is the right tool here. Typically batch processing can be done more effectively and efficiently by native database tools, ETL programs, and other options. Does it have to be an ORM mapping tool?
Do you really need all rows in memory at the same time? Can't you devide the data into some logically connected parts which are independent? How much memory do you have? How many fields does your table have ans how big are they? I think there shouldn't be a problem to hold 100K rows.
If you don't need everything in memory you can use scrollable resultsets. And don't forget to flush() and clear() session from time to time (100 to 1000 rows).
The batch functionality is an extension of our current on-demand java application. The java app contains the business processing needed for the mentioned records part of the batch op. So it's inevitable to use/invoke the java app in a batch.
No there is no need to have all the 100k records in memory at the same time.
Another approach to get around the memory size issue (approach1) is to divide the entire batch program into smaller chunks say 10k records at a time, load it and process. This works well but we limit the resultset to size of 10k [criteria.maxresults(10000)] in one run and needs executing the batch utility 10 or more times. In this case how can I ensure how to size the loop of 10 or more or less batch routines. Can you elaborate on the Scrollable results set suggestion.