I need some suggession on the below issue in Java.
In the application, there is a screen, where the user can upload the data from the CSV file. ideally, the CSV file can contains around 30000-50000 records. Now, on uploading the file, the java program has to verify whether there are any duplicate records in the CSV file. If there are any duplicates, then the program has to exclude the recods and continue with the next record. Eles if not duplicate, the the code has to insert the data in the database.
Note : As per the application architecture, the application inserts first 1000 records and next considres the nest 1000 records and so on...
I would not write this in Java, but script it in shell. Use sort(1) to sort the csv, then drop the duplicates with uniq(1) and finally use the database command line tool to load the data into the database.
Depends a bit on what you consider a "duplicate" record. As a minimum, you want to call "trim()" on the input record. And probably toLowerCase() if you can. It also depends a lot of how long the CSV records are. Under 50 to 100 characters is probably OK, otherwise, you have serious issues with memory size.
The obvious approach is to create a HashSet, and store each record in it. When you get a new record, query the hashset to see if it has a duplicate, and if so, move to the next.
The definition of "duplicate" is critical. Consider as an example:
Kees, 1, yes
Is the record
a duplicate? How about
Sorting a CSV file using a plain text sort utility probably won't work, since the records frequently have variable-width fields.
When you're talking 30K records and upwards however, I start thinking things like databases, since that's likely to consume RAM by the megabyte.
In the Real World, most likely I'd run the CSV through something that could convert the data to fixed columns, sort the data using a sort/merge utility, and then filter for duplicates. In the old mainframe days of yore, we'd probably even add a sort exit that did the removal of duplicates, although Unix/Linux can also pipe through "uniq".
Of course, in the mainframe days of yore, we had to do things like that, since megabytes of RAM was probably more than the entire machine had, much less any one application. Sorting typically involved 3-5 work files for the utility to hold intermediate results.
The brute-force load to database is fairly simple, but rarely performant, since the indexes spend a lot of time being reorganized, and that's expensive. Database loads are best done with indexing disabled until after the entire data collection has been loaded.
Customer surveys are for companies who didn't pay proper attention to begin with.
if the checking the duplicates and removing them is costly in your case .another alternative is to pass the whole records to data base(insert the records in data base using addbatch) and then invoking another procedure in database which will do this kind of duplication chcek and removal of duplicates .
I believe the number of records to be handled ,will not create any performance problem in Database front.
Can you give us some details on the size and properties of a record and on what kind of hardware this is supposed to be running on? 50k records is practically nothing. Short of every CSV record taking 5-10k and the uniqueness check being complicated for some reason there is no reason this can't all be done within a single digit amount of seconds, and in memory, by the way.