File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Performance and the fly likes Performant Design Question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "Performant Design Question" Watch "Performant Design Question" New topic

Performant Design Question

Chris Bicnal
Ranch Hand

Joined: Aug 17, 2005
Posts: 80
Hey everyone,

I have a quick question about how to design a batch process to read data from files and insert into a database with Java.

In a nutshell, I am going to receive numerous CSV files every day from different organisations. All files are of the same format and contain people, cases and alerts. A person can be in zero to many cases. And an alert is always associated to a case. Each file can be in any order with people, cases and alerts spread throughout the file.

My knee jerk solution to this problem is to read each line of the file into memory sorting the people, cases and alerts into separate Hashmaps as I go. I can then process the people, cases and alerts in sequence rejecting any records that may be invalid.

However, with this solution if the file is big (i.e. 100,000 lines) I will have memory issues if I cache everything. Although given that I can't rely on the structure of the file I can't think of any better way to do this.

Any ideas?



Martin Vajsar

Joined: Aug 22, 2010
Posts: 3733

"100,000 lines" tells nothing if you don't specify a line length. Assuming 100 bytes per line, we get a modest 10 MB file. This does not seem to be a problem nowadays. Even if it is, installing another 1 GB DIMM into your production box might be cheaper than designing your program to work on 256 MB box.

Anyway, I can envision these alternatives:

1) Pass through the file several times, processing cases in the first pass, then people, then alerts (assuming such order can be made to work).

2) If the files can get really large, pass through it once, splitting it into separate cases, people and alerts files. Then either process the files as in the previous case, or, if some relations among the records must be processed, sort the files on the disc and use some sort of binary search or indexes to navigate through them. Sounds like an awful lot of work, actually.

3) If your database supports it, load the whole file into a special staging table and process it completely using SQL commands. You might load the file using specialized database tools (eg. SQL Loader for Oracle, though on Oracle another facility - external tables - might work great for this case), which are fast and require no programming on your part. Processing the file using SQL might be challenging if you don't know SQL really well, but if done right, it might perform better than all other solutions; depending on the database and details of your process (in Oracle you might be able to use insert with the append hint, for example).
Philip Thamaravelil
Ranch Hand

Joined: Feb 09, 2006
Posts: 99
If you are not required to use Java., Perl is excellent for this type of task.
Pat Farrell

Joined: Aug 11, 2007
Posts: 4659

Philip Thamaravelil wrote:If you are not required to use Java., Perl is excellent for this type of task.

Depending on the DBMS package you have, you might not even need Perl.

MySql can directly read CSV files and with a ten line insert command, it can read fixed format data and split it up.

Nearly anything that can be done directly by the DBMS will be faster than any program in any language, be it Java, Perl, or Fortran.
I agree. Here's the link:
subject: Performant Design Question
It's not a secret anymore!