I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest me the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to do anything from Database side, like create indexes or stored procedures.
Thanks in advance.
[NK: Removed the blue font colour. It makes it difficult to read the post]
Why write them to a file in memory? Just stream them to a File on the file system. If you plan on pulling all those millions of records into memory first, then you better have some serious hardware and JVM configurations.
Paul Sturrock wrote:This is a big export so I have to ask, why are you using JDBC at all? Databases come with better bulk import/export tools that you are liable to write yourself why not just use them?
Unfortunately, these tools aren't always made freely available to us peons. I could only use SQL*Loader by logging into a remote machine for example. But certainly it's an option.
Another possibility is to use an ETL tool such as the Kettle app from Pentaho. It should be capable of doing the required extraction and transformations and you can use a GUI designer (Spoon) to construct the extraction and transformation rules. The Pentaho suite is written in Java, and can be extended via custom Java code, but there's nothing exotic enough for this particular project to need that. And it already has various tricks built in to enhance the performance of the process.
Definitely streaming is the way to go. I don't recommend dumping this much data into RAM unless you intend to do some heavy cross-referencing. Which is probably better done on the database backend in any case.
Customer surveys are for companies who didn't pay proper attention to begin with.
If core java is only option available to you then I would prefer so bring down the batch size down and free the resource as soon as processing is done.
For example, free result set once you build hash table. free hash table objects as soons as you write it to file.
Also why do you need hash table.. you can just read from resultset process the date columns and write to file..that way you would save some memory.
SCJP 5, SCDJWS<br /> <br />It's amazing how premature optimisation is both seductive and destructive; even when you know