File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Duplicate objects ending up in a TreeSet Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Duplicate objects ending up in a TreeSet" Watch "Duplicate objects ending up in a TreeSet" New topic
Author

Duplicate objects ending up in a TreeSet

Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Hi,

I have a TreeSet into which I am adding objects.

The objects being added are my own class and that class implements Comparable and defines the compareTo(Object a) method.

Each object also defines it's own hashcode based on a couple of properties that it is initialized with.

In the compareTo method, the first thing checked is to see if the hashcode values are the same. If they are, return 0. This should prevent duplicate objects from getting into the TreeSet shouldn't it?

However, it does not.

In my test case, I have 10 objects. 2 x 5 unique objects. As each is initialized and added to the TreeSet, duplicates should be tossed and the final size of the TreeSet should be 5 objects. However, for some reason it is actually duplicating 2 of the objects and the final size is 7.

I set debug output at the time the object is being added and I can see the identical hashcodes so I am not sure why two objects with identical hashcodes would get added into the TreeSet.

Any idea why this would happen?

Thanks in advance.
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
additional info.. when each object is added, it does not appear to be comparing itself to every other object currently in the TreeSet.
Raymond Tong
Ranch Hand

Joined: Aug 15, 2010
Posts: 230
    
    2

Comparable or Comparator is for ordering.
Override equals() AND hashCode() is for uniqueness.

You need to override equals() method as well.
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Raymond Tong wrote:Comparable or Comparator is for ordering.
Override equals() AND hashCode() is for uniqueness.

You need to override equals() method as well.


I'll check that. I think I am overriding both already.. (I know I am overriding hashCode())
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
Actually, in a TreeMap or TreeSet neither hashCode() or equals() are used at all. Equality is determined from compareTo() instead. Even if that result ends up inconsistent with what you would expect from a HashMap.

Rob, can you show us your compareTo() method? And for comparison, can you show how you've implemented hashCode() and equals()?
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

Rob MacKay wrote:In the compareTo method, the first thing checked is to see if the hashcode values are the same. If they are, return 0. This should prevent duplicate objects from getting into the TreeSet shouldn't it?

Equal hash codes does per se not mean equal objects. There are only 2^32 possible hash code values, so at some time two unequal objects may end up getting the same hash code.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
So I've been debugging this on and off for a few days and it seems that for some reason, no matter how many objects are currently in my TreeSet, when a new object is being added, it doesn't appear that compareTo is being called against every other object in the TreeSet. It's my understanding that when adding a new object in the TreeSet, it should be compared against every other object to ensure that it is not already present.

I am sending a basic "calling compareTo" message to the output window each time the compareTo method is called, it is also showing me the comparison data being used to determine if the objects are equal. If the TreeSet contains 10 objects and I am adding an 11th, I would expect to see 10 calls to compareTo. Correct?


Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Is this a TreeSet<E>? And if so does your class implement Comparable<E> by having a compareTo(E) method?
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
Rob MacKay wrote:So I've been debugging this on and off for a few days and it seems that for some reason, no matter how many objects are currently in my TreeSet, when a new object is being added, it doesn't appear that compareTo is being called against every other object in the TreeSet. It's my understanding that when adding a new object in the TreeSet, it should be compared against every other object to ensure that it is not already present.

No, not at all. Since the data in the TreeSet is sorted already, every time you add a new object, it only calls the compareTo() enough times to find the appropriate spot in the TreeSet to add the new object, and to check if there is already an equal object at that particular spot. No need to compare against all the other object that are nowhere nearby.

As an example, if your TreeSet contains Strings for each letter of the alphabet, like "A", "B", "C", ... "Z", and you want to add the String "Rob", then first the TreeSet will compare "Rob" to something in the middle of the sorted set - let's say "M". It will decide that "Rob" comes after "M", and thus there is no reason to compare against any of the other elements before "M". Next it will compare against something in the middle of the remaining range - let's say "T" - and decide that "Rob" is before "T". So now there's no reason to compare to any other element after "T". The TreeSet has determined that "Rob" is somewhere after "M" and before "T". It will perform several more comparisons in this manner, narrowing the range each time, until it decides that "R" and "S" are the elements that "Rob" should be in between, and neither of those elements are actually equal to "Rob".
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Paul Clapham wrote:Is this a TreeSet<E>? And if so does your class implement Comparable<E> by having a compareTo(E) method?


Yes
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Mike,

This makes perfect sense but how do you prevent the entire set from having any duplicated objects? I assume it has to do with the way that the compareTo() method is being implemented.

In my case, the objects are equal if their hashCode() method returns equal values. (I've guaranteed that every object will have a unique hash code) The hash code values are evaluated first, before proceeding further in the compareTo method. However, if the objects are not equal, then a different variable in my object (an integer called priority) is used. The lower the number, the higher it's priority in the TreeSet. Several objects can have the same value for priority.

Mike Simmons wrote:
No, not at all. Since the data in the TreeSet is sorted already, every time you add a new object, it only calls the compareTo() enough times to find the appropriate spot in the TreeSet to add the new object, and to check if there is already an equal object at that particular spot. No need to compare against all the other object that are nowhere nearby.

As an example, if your TreeSet contains Strings for each letter of the alphabet, like "A", "B", "C", ... "Z", and you want to add the String "Rob", then first the TreeSet will compare "Rob" to something in the middle of the sorted set - let's say "M". It will decide that "Rob" comes after "M", and thus there is no reason to compare against any of the other elements before "M". Next it will compare against something in the middle of the remaining range - let's say "T" - and decide that "Rob" is before "T". So now there's no reason to compare to any other element after "T". The TreeSet has determined that "Rob" is somewhere after "M" and before "T". It will perform several more comparisons in this manner, narrowing the range each time, until it decides that "R" and "S" are the elements that "Rob" should be in between, and neither of those elements are actually equal to "Rob".
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
Rob MacKay wrote:This makes perfect sense but how do you prevent the entire set from having any duplicated objects? I assume it has to do with the way that the compareTo() method is being implemented.

Yes, it's entirely determined by how you define compareTo() - that defines what is a "duplicate" as far as the TreeSet is concerned.

Rob MacKay wrote:In my case, the objects are equal if their hashCode() method returns equal values. (I've guaranteed that every object will have a unique hash code) The hash code values are evaluated first, before proceeding further in the compareTo method. However, if the objects are not equal, then a different variable in my object (an integer called priority) is used. The lower the number, the higher it's priority in the TreeSet. Several objects can have the same value for priority.

So, what happens if you have two objects with the same priority? What value is returned from the compareTo() method? If you return zero, then you are telling the TreeSet that those objects are considered duplicates of each other - and only one of those objects will be retained; the other will be dropped from the TreeSet. If you don't want them to be considered duplicates, you need to not return zero from compareTo(). If priorities are equal, compare using something else. Perhaps you might compare using the hashCode(), if there's nothing else you'd like to sort on.

By the way, you might possibly consider using a java.util.PriorityQueue instead. This allows you to sort objects according to a priority, but does not require that the priority be unique. Seems like it might be a quick shortcut to what you're trying to achieve.
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
I'll take a look at PriorityQueue.

For now,

In my compareTo method, it is basically like this:

if(this.hashCode() == obj.hashCode()) return 0;

then check the priority

if (this.priority < obj.priority) return -1;
else return 1;


So my assumption is that the hashCode is checked first but based on your previous description, that still won't guarantee that duplicates will get tossed.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
I'm not sure if this is the cause, but it looks like your compareTo() would violate the following requirement from the API: "The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y." As a quick fix, try this for the last part:
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Mike Simmons wrote:I'm not sure if this is the cause, but it looks like your compareTo() would violate the following requirement from the API: "The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y." As a quick fix, try this for the last part:


I'll give that a try.. Thanks
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Appears to work with a small number of objects. I'll have to scale it up to a large number of objects.

The most common use case will be that nearly all objects have the same priority (order is irrelevant) but that some of them will be duplicates (same hash code).
Martin Vajsar
Sheriff

Joined: Aug 22, 2010
Posts: 3610
    
  60

Rob MacKay wrote:So my assumption is that the hashCode is checked first but based on your previous description, that still won't guarantee that duplicates will get tossed.

No, its the other way around. If you consider two objects equal based on hashcode (as you do in the last version of your code), two distinct objects can be evaluated equal though they are not. This will in general cause the TreeSet not to accept an object that has not been actually added yet. Anyway, get rid of the hashcode in your compareTo() method altogether. It does not belong there.

It seems that your problems are caused by the compareTo method, but just in case: are your objects immutable? If you change the value of an object after inserting it into the map or set (in a way that affects the compareTo() method, or in case of Hash... variants, the hashcode() and equals() methods), all sorts of strange behaviour may occur. Most pertinently to your case, the set might seem to be accepting items that it already contains.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38818
    
  23
Rob MacKay wrote: . . . I've guaranteed that every object will have a unique hash code . . .
How? Are you absolutely certain there will be no duplicates?
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
How? Are you absolutely certain there will be no duplicates?


Once the object is initially created, the values used to determine it's uniqueness never change. The object itself has many variables but only two of them, a String id and a String version, are used to create a unique hashcode. The hashcode is created using the code from the HashCodeUtil example in the Essential Java book.

At the time the objects are initialized, there certainly can be duplicates. I just need to prevent the duplicates from ending up in the TreeSet.
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Mike Simmons wrote:I'm not sure if this is the cause, but it looks like your compareTo() would violate the following requirement from the API: "The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y." As a quick fix, try this for the last part:


Mike, I scaled this up to a large number of objects with an equally large number of duplicates and it appears to have worked correctly.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
Rob MacKay wrote:
How? Are you absolutely certain there will be no duplicates?


Once the object is initially created, the values used to determine it's uniqueness never change.


This is good.

Rob MacKay wrote:The object itself has many variables but only two of them, a String id and a String version, are used to create a unique hashcode. The hashcode is created using the code from the HashCodeUtil example in the Essential Java book.

This seems problematic. I have Effective Java right here, and can't find any HashCodeUtil - where does this come from? The hashing techniques in Effective Java do not in general guarantee uniqueness - they just make it statistically very likely. If you show me exactly what code is used to create the hashCode(), I can probably find two objects that would have the same hashCode().

I agree with Martin's comments about using hashCode() in an equals() or compareTo() method seeming like a bad idea. It's possible, if indeed you can guarantee it's unique, as you claimed. But I suspect that's not really the case here.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38818
    
  23
Since there are 2 to the 32nd different possible hash code (= number of different ints), and log(2) * 32 / log(26) = 6.8078737137076209154 (give or take the odd dozen), that means that the set of Strings with ≥ 7 lower-case letters and no other kind of characters will have a cardinality > 2 to the 32nd. This will absolutely guarantee duplication of hash codes if that set is used in full.
You cannot give any guarantee that hash codes will be unique, except possibly for classes like these two.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38818
    
  23
And the set of Strings with exactly 7 lower-case letters and no other characters has a cardinality greater than 2 to the 32nd, too.
Rob MacKay
Ranch Hand

Joined: Apr 06, 2007
Posts: 35
    
    1
Mike Simmons wrote:
This seems problematic. I have Effective Java right here, and can't find any HashCodeUtil - where does this come from? The hashing techniques in Effective Java do not in general guarantee uniqueness - they just make it statistically very likely.


Sorry - it's based on the techniques in Effective Java. An example is here:

http://www.javapractices.com/topic/TopicAction.do?Id=28

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Duplicate objects ending up in a TreeSet