wood burning stoves 2.0*
The moose likes Performance and the fly likes Processing large files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Processing large files" Watch "Processing large files" New topic
Author

Processing large files

Meghana Reddy
Ranch Hand

Joined: Jan 29, 2002
Posts: 76
Hi

we have a requirement here, where we have to process a huge demographic file(millions of records and could be 4-5 GB in size)

This file could have duplicate records that should be eliminated and after that we apply some business rules(developed in java, now considering to implement a rules engine) before populating all those records in a db.

Right now , we are doing everything sequentially and we are able to process only 20 records per sec which is not per SLA and
I'm looking for opportunities/ideas to improve the speed of processing.

So, I'm thinking to separate out the tasks in processing this file and see which tasks can be executed in parallel.
I've read about Map/Reduce approach and is this use case a good candidate for the Map/Reduce?
What is the best approach to eliminate the duplicates from such a large data set?
Any other thoughts?


Meghana Reddy
ejaz khan
Greenhorn

Joined: Apr 08, 2013
Posts: 5
I would suggest before start processing, open the file in EditPlus 3.41, goto Edit -->Delete-->Delete Duplicate Lines
It will help you remove all the duplicate records quickly
ejaz khan
Greenhorn

Joined: Apr 08, 2013
Posts: 5
Another point, do not let your Java program to do duplicate hunting, instead create the unique/primary key rules on the DBMS and let the DBMS fail your duplicate records
Secondly, if the records do not have dependency, you can also use split file technique of UNIX. In this way, you will have smaller sized multiple files to process.
You can than use multiple threads to read split files in parallel and it will increase your read efficiency
Deepak Bala
Bartender

Joined: Feb 24, 2006
Posts: 6661
    
    5

You can use Map-Reduce to process each line of the file - yes, but this sounds more like a job for a ETL tool to me. Extract the contents of the text file -> Transform the values and eliminate duplicates -> Load to DB.

From those million rows, how many duplicates can be expected ?


SCJP 6 articles - SCJP 5/6 mock exams - More SCJP Mocks
Meghana Reddy
Ranch Hand

Joined: Jan 29, 2002
Posts: 76
Thank you for your responses:

ejaz khan wrote:I would suggest before start processing, open the file in EditPlus 3.41, goto Edit -->Delete-->Delete Duplicate Lines
It will help you remove all the duplicate records quickly


This is not an option since this is not a one time activity. This needs to be automated.

ejaz khan wrote: Another point, do not let your Java program to do duplicate hunting, instead create the unique/primary key rules on the DBMS and let the DBMS fail your duplicate records
Secondly, if the records do not have dependency, you can also use split file technique of UNIX. In this way, you will have smaller sized multiple files to process.
You can than use multiple threads to read split files in parallel and it will increase your read efficiency


This is another option, I'm considering. But we don't directly import into the transactional database at first shot.
We import this file into a temp table first and then start processing.
Since this is demographic file and luckily we have the SSN in the file, which we can use as a unique key on the temp table. But the problem is we need to know the exact values of each row that are duplicated, so that we can send that back in the rejected file to indicate what records were rejected and why!

Deepak Bala wrote: but this sounds more like a job for a ETL tool to me. Extract the contents of the text file -> Transform the values and eliminate duplicates -> Load to DB.

We don't have any ETL tools nor ETL expertise in our team, so, we may need to get someone who knows ETL to help with this.

Other than this, would using some sort of sorting (say merge sort) help identify the duplicates? I'm trying to see if someone had solved this problem using java before(however tedious, I'm pretty sure someone might have) and what their experiences had been.

Thanks again guys,
Meghana

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12759
    
    5
Yow, thats a big job for sure. How about this:

1. First pass, detect and make a list of SSN that have duplicates. If it will fit in memory this could be a Set of the SSN - which will tell you when a key is duplicated.

2. 2nd pass, detect the duplicate SSN, process the non-duplicate records and kick out the duplicates for review.

Interesting problem, let us know what you end up with.

Bill
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7515
    
  18

Meghana Reddy wrote:we have a requirement here, where we have to process a huge demographic file(millions of records and could be 4-5 GB in size)
This file could have duplicate records that should be eliminated and after that we apply some business rules(developed in java, now considering to implement a rules engine) before populating all those records in a db.

At the risk of repeating other advice, my first thought is that you shouldn't worry about duplicates - let the DB handle that. Unless you want to pre-process with something like a shell sort, the chances are that you ain't gonna be able to do that outside the database.

My second is that insufficient thought has gone into the processes that create this data.

I used to work at the UK Census Office back in the days of magnetic tapes and, believe me, probably 50% of the IT man-hours were spent on getting that data correct. We managed to get most census runs done in around 18 hours (≈55 million records in those days) on a machine that had 1Mb (yes, ONE) of memory - and that's a LOT better than 20 records/sec.

And the paradigm was simple:
  • Gather
  • Rationalise
  • Process

  • In our day, there was inevitably some crossover between the first two steps, but these days I suspect there doesn't need to be.
    If I was looking at this, my first thought would be something on the lines of:
  • Plough the data into the DB as it comes, and as fast as possible, eliminating duplicates as you can.
  • Rationalize the data for processing - which will probably include eliminating any 'leftover' duplicates, and may involve populating multiple tables.
  • Process it.
  • And to be honest, I wouldn't get Java involved until that third step - although I suppose it could also be involved in the "plough" phase.

    That 2nd step is definitely a task for your database gurus though. This is precisely what they were designed to do, and the more that can be done internally (ie, without any Java involvement) the better.

    I have a feeling that these days databases are simply regarded as an "extension" of Java, and that's not the case. They are enormous mills of processing power, designed specifically for arranging large volumes of data logically; so the more you can leave them alone to get on with it, the better your results are likely to be.

    My two-penn'orth, for what it's worth.

    Winston

    Isn't it funny how there's always time and money enough to do it WRONG?
    Articles by Winston can be found here
    Sunderam Goplalan
    Ranch Hand

    Joined: Oct 10, 2011
    Posts: 73
    I'd recommend you look into Spring Batch. I believe Spring Batch aids in processing millions of records by using thread pools and stuff like that. From your description of the problem, it looks to me that your use case would be good match for Spring batch.


    SCJP 5.0 , SCEA Java EE 5
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Processing large files
     
    Similar Threads
    large resultset connection times out
    Processing input files with Servlets
    Hibernate performance issues using huge databases
    Not getting performance with MapReduce
    candidate for map reduce