aspose file tools*
The moose likes Performance and the fly likes ArrayList without duplicates Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "ArrayList without duplicates" Watch "ArrayList without duplicates" New topic
Author

ArrayList without duplicates

ryan headley
Ranch Hand

Joined: Jun 28, 2000
Posts: 156
What is the quickest way to cycle through and ArrayList to remove any duplicate entries.
I am modifying some existing code and can't really change the architecture at this point in time, so I need to simply cycle thru the list and check for dupes. If they exist, delete them.
I have done this with normal Arrays and for loops before but the ArrayList is throwing me a knot.
[ May 16, 2002: Message edited by: ryan headley ]

Ryan Headley<br /><a href="http://www.sudovi.com" target="_blank" rel="nofollow">http://www.sudovi.com</a>
Roy Ben Ami
Ranch Hand

Joined: Jan 13, 2002
Posts: 732
if you already know how to do it with arrays and have the code for that why not convery the ArrayList to an array???
use the toArray() method and you get your list as a regular array.
modify it then and return it back to a list .
sounds ok?
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by ryan headley:

I have done this with normal Arrays and for loops before but the ArrayList is throwing me a knot.

How did you do it with arrays? What problems do you encounter by translating your algorithm to a List?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
The collections framework has some classes that make this very easy. Remember that a Set is a Collection that does not allow duplicates - so most of the owrk of detecting duplicates quickly and easily has already been done for you.

It's well worthwhile to study the collections framework classes carefully.
[ May 16, 2002: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
I haven't tried this but it seems like it should work:
HashSet h = new HashSet();
h.addAll(arrayList);
arrayList.clear();
arrayList.addAll(h);
What might prevent this from working is:
IllegalArgumentException - some aspect of an element of the specified collection prevents it from being added to this collection.
I can't try it now but I'll try it later unless someone else can try it.


Associate Instructor - Hofstra University
Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Yeah, that's another possibility. Simpler, probably faster, but it also destroys whatever order the list was in in the first place - which may or may not be a problem, depending on the application.
Note that you can combine the first two lines:
HashSet h = new HashSet(arrayList);
Looking back at my own code, I note that its performance may well be quadratic - that is, the time to execute can be proportional to the sqare of the number of elements in the array. This is because the Iterator remove() method has linear- time performance for an ArrayList, and it will probably be called a number of times proportional to the number of elements. This could lead to really ugly performance for large values of n. Better would be:

creating a second list may seem unnecessary overhead, but at least it's linear, not quadratic. (Other minor performance tweaks are available, but going from quadratic to linear is the main one I'm concerned with.)
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
HashSet h = new HashSet(arrayList);
arrayList.clear();
arrayList.addAll(h);
If the order doesn't matter then this works. The exception does not get thrown.
As a test, I created a ListArray containing exactly 1,000,000 duplicate objects. It took 235 milliseconds to run the 3 instructions above to reduce it to an ArrayList of 1 entry. It took 1828 milliseconds when there were 1,000,000 entries with no duplicates.
Roger Thornhill
Author
Greenhorn

Joined: May 15, 2002
Posts: 25
Just a note about order.
If you wanted to maintain order and use something like the HashSet approach, you could try TreeSet, which implements SortedSet and maintains order. You would then need to rebuild the original ArrayList using the Iterator returned by the TreeSet. The tradeoff, of course, is performance. My tests indicate that it is at least twice a slow as the HashSet method (nearly all of the time being consumed during instantiation of TreeSet).
But if you need order to be guaranteed, TreeSet might be a reasonably efficient compromise.
Roger Thornhill
Author
Greenhorn

Joined: May 15, 2002
Posts: 25
Did some more testing and found that the TreeSet appraoch can actually be faster than the HashSet approach - depending on to what extent the objects in the list are unique. This reminds us that the efficiency of TreeSet is understandably related to the breadth of the underlying tree.
For example, if a list of 100,000 objects contains a lot of duplicates (say there are only 100 unique objects - distributed evenly through the list), duplicate removal via TreeSet is about 30% faster. For a list of 100,000 objects, TreeSet gets slower when more than about 3500 are unique.
So, you can have it all -- order and good performance -- if your List contains a lot of duplicates.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Putting elements in a TreeSet doesn't maintain the original order - rather, it sorts the elements, which means they're put into a new order. Unless the elements were already sorted in the first place, in which case it would be even easier to remove duplicates. Depending on the nature of the application, it may well be acceptable or even desireable to sort the elements in addition to removing duplicates. But it's important to recognize that "ordered" and "sorted" are not the same thing.
Roger Thornhill
Author
Greenhorn

Joined: May 15, 2002
Posts: 25
Thanks, Jim, that is a very important point to make. I was confusing list order with sorted order in the original question. Interestingly, if you do need to sort and remove duplicates, it it turns out that TreeSet is still only worthwhile if there are many duplicates in the List. For example, if all objects are unique, it is actually faster to use a HashSet and Collections.sort() approach than it is to use TreeSet. The latter only outperforms the former in the scenario I mentioned above -- when there are many duplicates.
Roger Thornhill
Author
Greenhorn

Joined: May 15, 2002
Posts: 25
One last mildly interesting point, for those still following this thread.
It turns out that there are times when a (HashSet|TreeSet)+ArrayList approach is more efficient than a HashSet-only approach, and times when it is not. This is interesting because the former maintain order while the latter does not.
Here is some code that runs tests on most of the cases discussed above. It compares HashSet to TreeSet as well as both combined with ArrayList (to ensure list order).

The (crude) code test above takes, as arguments, the number of objects in the list and the number of unique objects (they are distributed randomly).
My sample runs looked like this:
(first arg is number of objects, second arg is number of unique objects -- distributed randomly)
% java NoDupsTest 500000 500000
% java NoDupsTest 500000 250000
etc.
and resulted in

In summary, to efficiently remove duplicate objects from an ArrayList, HashSet+ArrayList is a good general strategy and maintains list order. The main disadvantage is, however, more memory. Most interestingly (although it makes sense when you think about it some), both (HashSet|TreeSet)+AL can be superior to HashSet alone.
I'd be curious to hear if anyone finds a different result.
[ May 19, 2002: Message edited by: Greg Barish ]
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
What about a LinkedHashSet? It took about twice as long as the HashSet but it will maintain items in the order they are added.
Ritesh Agrawal
Ranch Hand

Joined: Jan 08, 2004
Posts: 74
Hi Friends,
The idea presented in this topic was really useful. Thanks for that. I was looking for something similar.
I was trying to remove duplicates (possibly triplicates too) from a ArrayList. The nunber of words is about 1 million. I got a OutOfMemory Error. Is there some implementation, which possibly be slower (speed is of not much concern), but is memory efficient.
My implementation is like this

I am getting this OutOfMemory error in Line 3 it seems.

Any help would be useful.

Thanks.


Ritesh<br /> <br />SCJP 1.4<br />IBM Test 340<br />IBM AIX V4.0 Certified Professional<br /> <br />Right actions for the future are the best apologies for wrong ones in the past.<br />- Tyron Edwards
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Ritesh,

what exactly are you trying to do? Do you need to know which ones were duplicates? Do you need to have the words in a list? Or do you just need a set of all words, without duplicates?
Ritesh Agrawal
Ranch Hand

Joined: Jan 08, 2004
Posts: 74
Hi Ilja,

Accept my apologies for late response.

I want to develop a routine which will read a file containing about a million words.Some of the words might be repeated more than 2 times. The routine is supposed to generate a output file which contain following informatioon

[word] , [number of occurences] , [line no. of first occurence , second occurence..and so on]

example
xyz,3,[3,6,7]

Thanks
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Ahh, that's a very different problem from removing duplicates from an ArrayList. To tackle this I would create a data structure that maintains this information, and update it one word at a time as you read from the file.

As a first cut, you can model an individual WordEntry (pick a better name though) asCreate a class (WordMap) that manages a Map of these objects with the word as the key and the WordEntry as the value. As you read each word, pass it to the WordMap to be added to the Map:This method would look in the Map to see if there's already an entry. If so, add lineNumber to its List. Otherwise create a new entry and put it into the Map.

If there will be a significant number of words that appear only once, you can save the cost of the List in those cases by adding an "int lineNumber" to WordEntry. New WordEntries put the line number into that variable and set the List to null. When the word is seen a second time, create a List and add the first and second line numbers to it (or just the second since you already have the first one stored).

In any case, once you finish running through the file, the WordMap is already in the correct form that you wanted. There's no need to post-process it.

[ Added: ]

Actually, you don't even need to have a count field in WordEntry asin the original design and in the second design. For a million words, that's a savings of two megabytes of RAM.
[ April 29, 2005: Message edited by: David Harkness ]
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
You possibly also don't need the word field, as that should always be the same as the key you used to look it up. So you basically need to put a List of the line numbers into a Map with the word as the key.
D Rog
Ranch Hand

Joined: Feb 07, 2004
Posts: 472

I use a similar approach for a search engine, however I stored a position in file instead of line number, because it's much simple to highlight words after.


Retire your iPod and start with HD Android music player Kamerton | Minimal J2EE container is here | Light weight full J2EE stack | and build tool | Co-author of "Windows programming in Turbo Pascal"
shriganesh kakade
Greenhorn

Joined: May 10, 2012
Posts: 23
ryan headley wrote:What is the quickest way to cycle through and ArrayList to remove any duplicate entries.
I am modifying some existing code and can't really change the architecture at this point in time, so I need to simply cycle thru the list and check for dupes. If they exist, delete them.
I have done this with normal Arrays and for loops before but the ArrayList is throwing me a knot.
[ May 16, 2002: Message edited by: ryan headley ]


i have one problem with arraylist
shriganesh kakade
Greenhorn

Joined: May 10, 2012
Posts: 23
shriganesh kakade wrote:
ryan headley wrote:What is the quickest way to cycle through and ArrayList to remove any duplicate entries.
I am modifying some existing code and can't really change the architecture at this point in time, so I need to simply cycle thru the list and check for dupes. If they exist, delete them.
I have done this with normal Arrays and for loops before but the ArrayList is throwing me a knot.
[ May 16, 2002: Message edited by: ryan headley ]


i have one problem with arraylist



i want to add employee class object to the array list
with employee education related fields
like degree name, year of passing, percentage
and all these fields should be added to arraylist
before adding first the degree name should be checked if it is already there in the list
and if not then only it should be added and if it is there then update that one with new percentage and year of passing
but degree should not be changed, there is combo box for degree
Thank you

Sudhir Ravindra
Greenhorn

Joined: May 31, 2010
Posts: 16
shriganesh kakade wrote:
shriganesh kakade wrote:
ryan headley wrote:What is the quickest way to cycle through and ArrayList to remove any duplicate entries.
I am modifying some existing code and can't really change the architecture at this point in time, so I need to simply cycle thru the list and check for dupes. If they exist, delete them.
I have done this with normal Arrays and for loops before but the ArrayList is throwing me a knot.
[ May 16, 2002: Message edited by: ryan headley ]


i have one problem with arraylist



i want to add employee class object to the array list
with employee education related fields
like degree name, year of passing, percentage
and all these fields should be added to arraylist
before adding first the degree name should be checked if it is already there in the list
and if not then only it should be added and if it is there then update that one with new percentage and year of passing
but degree should not be changed, there is combo box for degree
Thank you



Lost you totally. Can you explain the problem statement better. Posting some code on your domain and its relationships would help too.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Sudhir Ravindra wrote:Lost you totally. Can you explain the problem statement better.

@shriganesh: Can you also explain why you've resurrected a Thread that is more than 10 years old?
If you were looking to break a record, I'm afraid you haven't.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Sally william
Greenhorn

Joined: Mar 03, 2012
Posts: 15
Jim Yingst wrote:The collections framework has some classes that make this very easy. Remember that a Set is a Collection that does not allow duplicates - so most of the owrk of detecting duplicates quickly and easily has already been done for you.

It's well worthwhile to study the collections framework classes carefully.
[ May 16, 2002: Message edited by: Jim Yingst ]


This is what i was looking for...thanks !1
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: ArrayList without duplicates