aspose file tools*
The moose likes Java in General and the fly likes Need an approach to reduce the number of iterations Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Need an approach to reduce the number of iterations" Watch "Need an approach to reduce the number of iterations" New topic
Author

Need an approach to reduce the number of iterations

chandra kambham
Ranch Hand

Joined: Jun 09, 2008
Posts: 74
Hi All,

I need an approach to achieve the below requirement.

I am having a file with 10 lakh lines. I need to compare each line with the remaining lines in the file and find out the lines that are similar.
In this case i need to iterate over the same file for 10 lakh times which will take lot of time.
I thought of putting all these 10 lakh lines in an array list and then iterate over this list but this will take lot of heap space and it may run out of memory.

Is there any better approach to achive this...

Please reply with your suggestions,

Thanks in Advance,
K.Chandra Sekhar
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
chandra kambham wrote:...
and it may run out of memory.
...


Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39393
    
  28
Do you mean similar or the same?
Consider other types of Collections, which might be able to mark mappings, or delete duplicates.
chandra kambham
Ranch Hand

Joined: Jun 09, 2008
Posts: 74
Hi Campbell ,

I want the lines with the same content. I tried to implement this by using ArrayList and when i ran the program the CPU utilization was at it's peak and system got hanged for some time...
Tom Tillinghast
Greenhorn

Joined: Apr 29, 2006
Posts: 4
chandra,
Instead of throwing all the lines in a single Collection at once, you can group the lines by which character they start with. That will significantly reduce memory consumption.
If somehow it still isn't enough, you could group them by the first 2 characters.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18896
    
  40

Piet Verdriet wrote:
Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.


Totally agree. 10 lakh is a million lines right? So if each line is 80 characters, then we are talking about 80 (or 160) megabytes of data right?

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Anu Pillai
Greenhorn

Joined: Oct 09, 2006
Posts: 28
why dont you try adding each line you read into a set, in case the line is a duplicate entry, the add operation will return a value of false, so you will get all the duplicate entries as well.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18651
    
    8

I would just throw all of the lines into a database table with one column, then let the database do the heavy-duty work it is designed to do.
chandra kambham
Ranch Hand

Joined: Jun 09, 2008
Posts: 74
Yes Henry,

The file size is about 100 Mb.
chandra kambham
Ranch Hand

Joined: Jun 09, 2008
Posts: 74
Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.
John Kimball
Ranch Hand

Joined: Apr 13, 2009
Posts: 96
First, you have to define "similar".

That said, create a data structure which represents a line with the following:
- A integer hash of the line (not necessarily the object's hashcode).
- Another int representing the line number.

Also, create two comparators:
- One which sorts by hash, followed by line number.
- One which sorts by line number (the list using this comparator should be much smaller!)

Once you have this, you can create a pair of lists and use the collection sort method.

There's still quite a bit more you need to do after you have your sorted lists, but I think you have enough to get started!
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39393
    
  28
No longer a "beginner's" question. Moving.
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
chandra kambham wrote:Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.


Well, don't you want to put your computer to work? Isn't that what a computer is supposed to do: doing a lot of things every second?
You also ignored, or not respond, to my first reply so let me ask you again: have you tried implementing your problem in a straight forward fashion? Like I said: you wouldn't be the first programmer who would look for an super-fancy-optimized solution while something less complicated may have done the trick. If you have done so and have concluded this straight forward solution was performing too poorly, I'd appreciate it if you say so: then I wouldn't be nagging you about this. ; )
chandra kambham
Ranch Hand

Joined: Jun 09, 2008
Posts: 74
Hi Piet,

Initially i have implemented this by throwing all the data into a single ArrayList but that was taking too much of memory and too high times of execution.
Then with the idea of having multiple arraylists based on the character (Which was given by Tom) i could implement this with less memory consumption and was able to execute the program with in 4-5 Seconds of time.

Thanks all for your valuable suggestions.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Need an approach to reduce the number of iterations