Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Suggestion Needed on tuning the performance.

 
Seenu ram
Greenhorn
Posts: 8
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I need some suggession on the below issue in Java.

Scenario:

In the application, there is a screen, where the user can upload the data from the CSV file. ideally, the CSV file can contains around 30000-50000 records. Now, on uploading the file, the java program has to verify whether there are any duplicate records in the CSV file. If there are any duplicates, then the program has to exclude the recods and continue with the next record. Eles if not duplicate, the the code has to insert the data in the database.

Note : As per the application architecture, the application inserts first 1000 records and next considres the nest 1000 records and so on...

Please suggest
 
Kees Jan Koster
JavaMonitor Support
Rancher
Posts: 251
5
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear Seenu,

I would not write this in Java, but script it in shell. Use sort(1) to sort the csv, then drop the duplicates with uniq(1) and finally use the database command line tool to load the data into the database.

Kees Jan
 
Seenu ram
Greenhorn
Posts: 8
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks!... Now that there seems to be a chance for sorting the CSV file first on upload but before insert.

But, is Sorting the CSV file costlier in Java (As I have no option of using a shell to do the task.... )?
 
Pat Farrell
Rancher
Posts: 4660
5
Linux Mac OS X VI Editor
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depends a bit on what you consider a "duplicate" record. As a minimum, you want to call "trim()" on the input record. And probably toLowerCase() if you can. It also depends a lot of how long the CSV records are. Under 50 to 100 characters is probably OK, otherwise, you have serious issues with memory size.

The obvious approach is to create a HashSet, and store each record in it. When you get a new record, query the hashset to see if it has a duplicate, and if so, move to the next.

The definition of "duplicate" is critical. Consider as an example:

Kees, 1, yes
Seenu,2,no
Pat,3,maybe

Is the record
Kees,1,yes
a duplicate? How about
Kees,001,yes


 
Tim Holloway
Saloon Keeper
Pie
Posts: 17639
39
Android Eclipse IDE Linux
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorting a CSV file using a plain text sort utility probably won't work, since the records frequently have variable-width fields.

When you're talking 30K records and upwards however, I start thinking things like databases, since that's likely to consume RAM by the megabyte.

In the Real World, most likely I'd run the CSV through something that could convert the data to fixed columns, sort the data using a sort/merge utility, and then filter for duplicates. In the old mainframe days of yore, we'd probably even add a sort exit that did the removal of duplicates, although Unix/Linux can also pipe through "uniq".

Of course, in the mainframe days of yore, we had to do things like that, since megabytes of RAM was probably more than the entire machine had, much less any one application. Sorting typically involved 3-5 work files for the utility to hold intermediate results.

The brute-force load to database is fairly simple, but rarely performant, since the indexes spend a lot of time being reorganized, and that's expensive. Database loads are best done with indexing disabled until after the entire data collection has been loaded.
 
Anish Kuti
Greenhorn
Posts: 29
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
if the checking the duplicates and removing them is costly in your case .another alternative is to pass the whole records to data base(insert the records in data base using addbatch) and then invoking another procedure in database which will do this kind of duplication chcek and removal of duplicates .
I believe the number of records to be handled ,will not create any performance problem in Database front.
 
R van Vliet
Ranch Hand
Posts: 144
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you give us some details on the size and properties of a record and on what kind of hardware this is supposed to be running on? 50k records is practically nothing. Short of every CSV record taking 5-10k and the uniqueness check being complicated for some reason there is no reason this can't all be done within a single digit amount of seconds, and in memory, by the way.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic