File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Performance and the fly likes Suggestion Needed on tuning the performance. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "Suggestion Needed on tuning the performance." Watch "Suggestion Needed on tuning the performance." New topic

Suggestion Needed on tuning the performance.

Seenu ram

Joined: Sep 12, 2009
Posts: 8

I need some suggession on the below issue in Java.


In the application, there is a screen, where the user can upload the data from the CSV file. ideally, the CSV file can contains around 30000-50000 records. Now, on uploading the file, the java program has to verify whether there are any duplicate records in the CSV file. If there are any duplicates, then the program has to exclude the recods and continue with the next record. Eles if not duplicate, the the code has to insert the data in the database.

Note : As per the application architecture, the application inserts first 1000 records and next considres the nest 1000 records and so on...

Please suggest
Kees Jan Koster
JavaMonitor Support

Joined: Mar 31, 2009
Posts: 251
Dear Seenu,

I would not write this in Java, but script it in shell. Use sort(1) to sort the csv, then drop the duplicates with uniq(1) and finally use the database command line tool to load the data into the database.

Kees Jan

Java-monitor, JVM monitoring made easy <- right here on Java Ranch
Seenu ram

Joined: Sep 12, 2009
Posts: 8
Thanks!... Now that there seems to be a chance for sorting the CSV file first on upload but before insert.

But, is Sorting the CSV file costlier in Java (As I have no option of using a shell to do the task.... )?
Pat Farrell

Joined: Aug 11, 2007
Posts: 4659

Depends a bit on what you consider a "duplicate" record. As a minimum, you want to call "trim()" on the input record. And probably toLowerCase() if you can. It also depends a lot of how long the CSV records are. Under 50 to 100 characters is probably OK, otherwise, you have serious issues with memory size.

The obvious approach is to create a HashSet, and store each record in it. When you get a new record, query the hashset to see if it has a duplicate, and if so, move to the next.

The definition of "duplicate" is critical. Consider as an example:

Kees, 1, yes

Is the record
a duplicate? How about

Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 17417

Sorting a CSV file using a plain text sort utility probably won't work, since the records frequently have variable-width fields.

When you're talking 30K records and upwards however, I start thinking things like databases, since that's likely to consume RAM by the megabyte.

In the Real World, most likely I'd run the CSV through something that could convert the data to fixed columns, sort the data using a sort/merge utility, and then filter for duplicates. In the old mainframe days of yore, we'd probably even add a sort exit that did the removal of duplicates, although Unix/Linux can also pipe through "uniq".

Of course, in the mainframe days of yore, we had to do things like that, since megabytes of RAM was probably more than the entire machine had, much less any one application. Sorting typically involved 3-5 work files for the utility to hold intermediate results.

The brute-force load to database is fairly simple, but rarely performant, since the indexes spend a lot of time being reorganized, and that's expensive. Database loads are best done with indexing disabled until after the entire data collection has been loaded.

An IDE is no substitute for an Intelligent Developer.
Anish Kuti

Joined: May 12, 2008
Posts: 29
if the checking the duplicates and removing them is costly in your case .another alternative is to pass the whole records to data base(insert the records in data base using addbatch) and then invoking another procedure in database which will do this kind of duplication chcek and removal of duplicates .
I believe the number of records to be handled ,will not create any performance problem in Database front.
R van Vliet
Ranch Hand

Joined: Nov 10, 2007
Posts: 144
Can you give us some details on the size and properties of a record and on what kind of hardware this is supposed to be running on? 50k records is practically nothing. Short of every CSV record taking 5-10k and the uniqueness check being complicated for some reason there is no reason this can't all be done within a single digit amount of seconds, and in memory, by the way.
I agree. Here's the link:
subject: Suggestion Needed on tuning the performance.
It's not a secret anymore!