Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Need an approach to reduce the number of iterations

 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I need an approach to achieve the below requirement.

I am having a file with 10 lakh lines. I need to compare each line with the remaining lines in the file and find out the lines that are similar.
In this case i need to iterate over the same file for 10 lakh times which will take lot of time.
I thought of putting all these 10 lakh lines in an array list and then iterate over this list but this will take lot of heap space and it may run out of memory.

Is there any better approach to achive this...

Please reply with your suggestions,

Thanks in Advance,
K.Chandra Sekhar
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
chandra kambham wrote:...
and it may run out of memory.
...


Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.
 
Campbell Ritchie
Sheriff
Posts: 48645
56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you mean similar or the same?
Consider other types of Collections, which might be able to mark mappings, or delete duplicates.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Campbell ,

I want the lines with the same content. I tried to implement this by using ArrayList and when i ran the program the CPU utilization was at it's peak and system got hanged for some time...
 
Tom Tillinghast
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
chandra,
Instead of throwing all the lines in a single Collection at once, you can group the lines by which character they start with. That will significantly reduce memory consumption.
If somehow it still isn't enough, you could group them by the first 2 characters.
 
Henry Wong
author
Marshal
Pie
Posts: 20991
76
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Piet Verdriet wrote:
Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.


Totally agree. 10 lakh is a million lines right? So if each line is 80 characters, then we are talking about 80 (or 160) megabytes of data right?

Henry
 
Anu Pillai
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
why dont you try adding each line you read into a set, in case the line is a duplicate entry, the add operation will return a value of false, so you will get all the duplicate entries as well.
 
Paul Clapham
Sheriff
Pie
Posts: 20945
31
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would just throw all of the lines into a database table with one column, then let the database do the heavy-duty work it is designed to do.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes Henry,

The file size is about 100 Mb.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.
 
John Kimball
Ranch Hand
Posts: 96
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, you have to define "similar".

That said, create a data structure which represents a line with the following:
- A integer hash of the line (not necessarily the object's hashcode).
- Another int representing the line number.

Also, create two comparators:
- One which sorts by hash, followed by line number.
- One which sorts by line number (the list using this comparator should be much smaller!)

Once you have this, you can create a pair of lists and use the collection sort method.

There's still quite a bit more you need to do after you have your sorted lists, but I think you have enough to get started!
 
Campbell Ritchie
Sheriff
Posts: 48645
56
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No longer a "beginner's" question. Moving.
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
chandra kambham wrote:Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.


Well, don't you want to put your computer to work? Isn't that what a computer is supposed to do: doing a lot of things every second?
You also ignored, or not respond, to my first reply so let me ask you again: have you tried implementing your problem in a straight forward fashion? Like I said: you wouldn't be the first programmer who would look for an super-fancy-optimized solution while something less complicated may have done the trick. If you have done so and have concluded this straight forward solution was performing too poorly, I'd appreciate it if you say so: then I wouldn't be nagging you about this. ; )
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Piet,

Initially i have implemented this by throwing all the data into a single ArrayList but that was taking too much of memory and too high times of execution.
Then with the idea of having multiple arraylists based on the character (Which was given by Tom) i could implement this with less memory consumption and was able to execute the program with in 4-5 Seconds of time.

Thanks all for your valuable suggestions.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic