• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Processing large files

 
Ranch Hand
Posts: 76
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi

we have a requirement here, where we have to process a huge demographic file(millions of records and could be 4-5 GB in size)

This file could have duplicate records that should be eliminated and after that we apply some business rules(developed in java, now considering to implement a rules engine) before populating all those records in a db.

Right now , we are doing everything sequentially and we are able to process only 20 records per sec which is not per SLA and
I'm looking for opportunities/ideas to improve the speed of processing.

So, I'm thinking to separate out the tasks in processing this file and see which tasks can be executed in parallel.
I've read about Map/Reduce approach and is this use case a good candidate for the Map/Reduce?
What is the best approach to eliminate the duplicates from such a large data set?
Any other thoughts?
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would suggest before start processing, open the file in EditPlus 3.41, goto Edit -->Delete-->Delete Duplicate Lines
It will help you remove all the duplicate records quickly
 
ejaz khan
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Another point, do not let your Java program to do duplicate hunting, instead create the unique/primary key rules on the DBMS and let the DBMS fail your duplicate records
Secondly, if the records do not have dependency, you can also use split file technique of UNIX. In this way, you will have smaller sized multiple files to process.
You can than use multiple threads to read split files in parallel and it will increase your read efficiency
 
Bartender
Posts: 6663
5
MyEclipse IDE Firefox Browser Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can use Map-Reduce to process each line of the file - yes, but this sounds more like a job for a ETL tool to me. Extract the contents of the text file -> Transform the values and eliminate duplicates -> Load to DB.

From those million rows, how many duplicates can be expected ?
 
Meghana Reddy
Ranch Hand
Posts: 76
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you for your responses:

ejaz khan wrote:I would suggest before start processing, open the file in EditPlus 3.41, goto Edit -->Delete-->Delete Duplicate Lines
It will help you remove all the duplicate records quickly



This is not an option since this is not a one time activity. This needs to be automated.

ejaz khan wrote: Another point, do not let your Java program to do duplicate hunting, instead create the unique/primary key rules on the DBMS and let the DBMS fail your duplicate records
Secondly, if the records do not have dependency, you can also use split file technique of UNIX. In this way, you will have smaller sized multiple files to process.
You can than use multiple threads to read split files in parallel and it will increase your read efficiency



This is another option, I'm considering. But we don't directly import into the transactional database at first shot.
We import this file into a temp table first and then start processing.
Since this is demographic file and luckily we have the SSN in the file, which we can use as a unique key on the temp table. But the problem is we need to know the exact values of each row that are duplicated, so that we can send that back in the rejected file to indicate what records were rejected and why!

Deepak Bala wrote: but this sounds more like a job for a ETL tool to me. Extract the contents of the text file -> Transform the values and eliminate duplicates -> Load to DB.


We don't have any ETL tools nor ETL expertise in our team, so, we may need to get someone who knows ETL to help with this.

Other than this, would using some sort of sorting (say merge sort) help identify the duplicates? I'm trying to see if someone had solved this problem using java before(however tedious, I'm pretty sure someone might have) and what their experiences had been.

Thanks again guys,
Meghana

 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yow, thats a big job for sure. How about this:

1. First pass, detect and make a list of SSN that have duplicates. If it will fit in memory this could be a Set of the SSN - which will tell you when a key is duplicated.

2. 2nd pass, detect the duplicate SSN, process the non-duplicate records and kick out the duplicates for review.

Interesting problem, let us know what you end up with.

Bill
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Meghana Reddy wrote:we have a requirement here, where we have to process a huge demographic file(millions of records and could be 4-5 GB in size)
This file could have duplicate records that should be eliminated and after that we apply some business rules(developed in java, now considering to implement a rules engine) before populating all those records in a db.


At the risk of repeating other advice, my first thought is that you shouldn't worry about duplicates - let the DB handle that. Unless you want to pre-process with something like a shell sort, the chances are that you ain't gonna be able to do that outside the database.

My second is that insufficient thought has gone into the processes that create this data.

I used to work at the UK Census Office back in the days of magnetic tapes and, believe me, probably 50% of the IT man-hours were spent on getting that data correct. We managed to get most census runs done in around 18 hours (≈55 million records in those days) on a machine that had 1Mb (yes, ONE) of memory - and that's a LOT better than 20 records/sec.

And the paradigm was simple:
  • Gather
  • Rationalise
  • Process

  • In our day, there was inevitably some crossover between the first two steps, but these days I suspect there doesn't need to be.
    If I was looking at this, my first thought would be something on the lines of:
  • Plough the data into the DB as it comes, and as fast as possible, eliminating duplicates as you can.
  • Rationalize the data for processing - which will probably include eliminating any 'leftover' duplicates, and may involve populating multiple tables.
  • Process it.
  • And to be honest, I wouldn't get Java involved until that third step - although I suppose it could also be involved in the "plough" phase.

    That 2nd step is definitely a task for your database gurus though. This is precisely what they were designed to do, and the more that can be done internally (ie, without any Java involvement) the better.

    I have a feeling that these days databases are simply regarded as an "extension" of Java, and that's not the case. They are enormous mills of processing power, designed specifically for arranging large volumes of data logically; so the more you can leave them alone to get on with it, the better your results are likely to be.

    My two-penn'orth, for what it's worth.

    Winston
     
    Ranch Hand
    Posts: 88
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I'd recommend you look into Spring Batch. I believe Spring Batch aids in processing millions of records by using thread pools and stuff like that. From your description of the problem, it looks to me that your use case would be good match for Spring batch.
     
    reply
      Bookmark Topic Watch Topic
    • New Topic