I need an approach to achieve the below requirement.
I am having a file with 10 lakh lines. I need to compare each line with the remaining lines in the file and find out the lines that are similar.
In this case i need to iterate over the same file for 10 lakh times which will take lot of time.
I thought of putting all these 10 lakh lines in an array list and then iterate over this list but this will take lot of heap space and it may run out of memory.
Instead of throwing all the lines in a single Collection at once, you can group the lines by which character they start with. That will significantly reduce memory consumption.
If somehow it still isn't enough, you could group them by the first 2 characters.
Piet Verdriet wrote:
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.
Totally agree. 10 lakh is a million lines right? So if each line is 80 characters, then we are talking about 80 (or 160) megabytes of data right?
That said, create a data structure which represents a line with the following:
- A integer hash of the line (not necessarily the object's hashcode).
- Another int representing the line number.
Also, create two comparators:
- One which sorts by hash, followed by line number.
- One which sorts by line number (the list using this comparator should be much smaller!)
Once you have this, you can create a pair of lists and use the collection sort method.
There's still quite a bit more you need to do after you have your sorted lists, but I think you have enough to get started!
Joined: Oct 13, 2005
No longer a "beginner's" question. Moving.
Joined: Feb 25, 2006
chandra kambham wrote:Hi Paul,
The initial thought was to add all these lines to a database and then process them by using the DB commands.
But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.
Well, don't you want to put your computer to work? Isn't that what a computer is supposed to do: doing a lot of things every second?
You also ignored, or not respond, to my first reply so let me ask you again: have you tried implementing your problem in a straight forward fashion? Like I said: you wouldn't be the first programmer who would look for an super-fancy-optimized solution while something less complicated may have done the trick. If you have done so and have concluded this straight forward solution was performing too poorly, I'd appreciate it if you say so: then I wouldn't be nagging you about this. ; )
Joined: Jun 09, 2008
Initially i have implemented this by throwing all the data into a single ArrayList but that was taking too much of memory and too high times of execution.
Then with the idea of having multiple arraylists based on the character (Which was given by Tom) i could implement this with less memory consumption and was able to execute the program with in 4-5 Seconds of time.