Meghana Reddy
ejaz khan wrote:I would suggest before start processing, open the file in EditPlus 3.41, goto Edit -->Delete-->Delete Duplicate Lines
It will help you remove all the duplicate records quickly
ejaz khan wrote: Another point, do not let your Java program to do duplicate hunting, instead create the unique/primary key rules on the DBMS and let the DBMS fail your duplicate records
Secondly, if the records do not have dependency, you can also use split file technique of UNIX. In this way, you will have smaller sized multiple files to process.
You can than use multiple threads to read split files in parallel and it will increase your read efficiency
Deepak Bala wrote: but this sounds more like a job for a ETL tool to me. Extract the contents of the text file -> Transform the values and eliminate duplicates -> Load to DB.
Meghana Reddy
Meghana Reddy wrote:we have a requirement here, where we have to process a huge demographic file(millions of records and could be 4-5 GB in size)
This file could have duplicate records that should be eliminated and after that we apply some business rules(developed in java, now considering to implement a rules engine) before populating all those records in a db.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
SCJP 5.0 , SCEA Java EE 5, TOGAF Certified