Win a copy of Svelte and Sapper in Action this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

Suggestion Needed on tuning the performance.

 
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I need some suggession on the below issue in Java.

Scenario:

In the application, there is a screen, where the user can upload the data from the CSV file. ideally, the CSV file can contains around 30000-50000 records. Now, on uploading the file, the java program has to verify whether there are any duplicate records in the CSV file. If there are any duplicates, then the program has to exclude the recods and continue with the next record. Eles if not duplicate, the the code has to insert the data in the database.

Note : As per the application architecture, the application inserts first 1000 records and next considres the nest 1000 records and so on...

Please suggest
 
JavaMonitor Support
Posts: 251
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dear Seenu,

I would not write this in Java, but script it in shell. Use sort(1) to sort the csv, then drop the duplicates with uniq(1) and finally use the database command line tool to load the data into the database.

Kees Jan
 
Seenu ram
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks!... Now that there seems to be a chance for sorting the CSV file first on upload but before insert.

But, is Sorting the CSV file costlier in Java (As I have no option of using a shell to do the task.... )?
 
Rancher
Posts: 4686
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depends a bit on what you consider a "duplicate" record. As a minimum, you want to call "trim()" on the input record. And probably toLowerCase() if you can. It also depends a lot of how long the CSV records are. Under 50 to 100 characters is probably OK, otherwise, you have serious issues with memory size.

The obvious approach is to create a HashSet, and store each record in it. When you get a new record, query the hashset to see if it has a duplicate, and if so, move to the next.

The definition of "duplicate" is critical. Consider as an example:

Kees, 1, yes
Seenu,2,no
Pat,3,maybe

Is the record
Kees,1,yes
a duplicate? How about
Kees,001,yes


 
Saloon Keeper
Posts: 22503
151
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorting a CSV file using a plain text sort utility probably won't work, since the records frequently have variable-width fields.

When you're talking 30K records and upwards however, I start thinking things like databases, since that's likely to consume RAM by the megabyte.

In the Real World, most likely I'd run the CSV through something that could convert the data to fixed columns, sort the data using a sort/merge utility, and then filter for duplicates. In the old mainframe days of yore, we'd probably even add a sort exit that did the removal of duplicates, although Unix/Linux can also pipe through "uniq".

Of course, in the mainframe days of yore, we had to do things like that, since megabytes of RAM was probably more than the entire machine had, much less any one application. Sorting typically involved 3-5 work files for the utility to hold intermediate results.

The brute-force load to database is fairly simple, but rarely performant, since the indexes spend a lot of time being reorganized, and that's expensive. Database loads are best done with indexing disabled until after the entire data collection has been loaded.
 
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
if the checking the duplicates and removing them is costly in your case .another alternative is to pass the whole records to data base(insert the records in data base using addbatch) and then invoking another procedure in database which will do this kind of duplication chcek and removal of duplicates .
I believe the number of records to be handled ,will not create any performance problem in Database front.
 
Ranch Hand
Posts: 144
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you give us some details on the size and properties of a record and on what kind of hardware this is supposed to be running on? 50k records is practically nothing. Short of every CSV record taking 5-10k and the uniqueness check being complicated for some reason there is no reason this can't all be done within a single digit amount of seconds, and in memory, by the way.
 
Listen. That's my theme music. That's how I know I'm a super hero. That, and this tiny ad told me:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
    Bookmark Topic Watch Topic
  • New Topic