• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Rob Spoor
  • Junilu Lacar
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Carey Brown
Bartenders:

Performant Design Question

 
Ranch Hand
Posts: 100
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hey everyone,

I have a quick question about how to design a batch process to read data from files and insert into a database with Java.

In a nutshell, I am going to receive numerous CSV files every day from different organisations. All files are of the same format and contain people, cases and alerts. A person can be in zero to many cases. And an alert is always associated to a case. Each file can be in any order with people, cases and alerts spread throughout the file.

My knee jerk solution to this problem is to read each line of the file into memory sorting the people, cases and alerts into separate Hashmaps as I go. I can then process the people, cases and alerts in sequence rejecting any records that may be invalid.

However, with this solution if the file is big (i.e. 100,000 lines) I will have memory issues if I cache everything. Although given that I can't rely on the structure of the file I can't think of any better way to do this.

Any ideas?

Thanks,

Chris

 
Sheriff
Posts: 3837
66
Netbeans IDE Oracle Firefox Browser
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
"100,000 lines" tells nothing if you don't specify a line length. Assuming 100 bytes per line, we get a modest 10 MB file. This does not seem to be a problem nowadays. Even if it is, installing another 1 GB DIMM into your production box might be cheaper than designing your program to work on 256 MB box.

Anyway, I can envision these alternatives:

1) Pass through the file several times, processing cases in the first pass, then people, then alerts (assuming such order can be made to work).

2) If the files can get really large, pass through it once, splitting it into separate cases, people and alerts files. Then either process the files as in the previous case, or, if some relations among the records must be processed, sort the files on the disc and use some sort of binary search or indexes to navigate through them. Sounds like an awful lot of work, actually.

3) If your database supports it, load the whole file into a special staging table and process it completely using SQL commands. You might load the file using specialized database tools (eg. SQL Loader for Oracle, though on Oracle another facility - external tables - might work great for this case), which are fast and require no programming on your part. Processing the file using SQL might be challenging if you don't know SQL really well, but if done right, it might perform better than all other solutions; depending on the database and details of your process (in Oracle you might be able to use insert with the append hint, for example).
 
Ranch Hand
Posts: 99
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you are not required to use Java., Perl is excellent for this type of task.
 
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Philip Thamaravelil wrote:If you are not required to use Java., Perl is excellent for this type of task.



Depending on the DBMS package you have, you might not even need Perl.

MySql can directly read CSV files and with a ten line insert command, it can read fixed format data and split it up.

Nearly anything that can be done directly by the DBMS will be faster than any program in any language, be it Java, Perl, or Fortran.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic