I have a quick question about how to design a batch process to read data from files and insert into a database with Java.
In a nutshell, I am going to receive numerous CSV files every day from different organisations. All files are of the same format and contain people, cases and alerts. A person can be in zero to many cases. And an alert is always associated to a case. Each file can be in any order with people, cases and alerts spread throughout the file.
My knee jerk solution to this problem is to read each line of the file into memory sorting the people, cases and alerts into separate Hashmaps as I go. I can then process the people, cases and alerts in sequence rejecting any records that may be invalid.
However, with this solution if the file is big (i.e. 100,000 lines) I will have memory issues if I cache everything. Although given that I can't rely on the structure of the file I can't think of any better way to do this.
"100,000 lines" tells nothing if you don't specify a line length. Assuming 100 bytes per line, we get a modest 10 MB file. This does not seem to be a problem nowadays. Even if it is, installing another 1 GB DIMM into your production box might be cheaper than designing your program to work on 256 MB box.
Anyway, I can envision these alternatives:
1) Pass through the file several times, processing cases in the first pass, then people, then alerts (assuming such order can be made to work).
2) If the files can get really large, pass through it once, splitting it into separate cases, people and alerts files. Then either process the files as in the previous case, or, if some relations among the records must be processed, sort the files on the disc and use some sort of binary search or indexes to navigate through them. Sounds like an awful lot of work, actually.
3) If your database supports it, load the whole file into a special staging table and process it completely using SQL commands. You might load the file using specialized database tools (eg. SQL Loader for Oracle, though on Oracle another facility - external tables - might work great for this case), which are fast and require no programming on your part. Processing the file using SQL might be challenging if you don't know SQL really well, but if done right, it might perform better than all other solutions; depending on the database and details of your process (in Oracle you might be able to use insert with the append hint, for example).