kri shan wrote:Can i avoid collision using default load factor 0.75 with setting right initial capacity(number of slots) ?
No, you cannot. A collision is a collision is a collision. If there are more possible instances than possible hashcode values there MUST be collisions (if there are 367 people in one room there must be at least two having the same birthday). HashMap can handle them because it first checks hashCode and if it has a collision it checks equals.
I think what Christian means to say is that in general it's not possible to completely avoid collision. For any data type with more that four bytes of data, collisions will be possible. However it's still possible to avoid them, meaning minimize their chance of occurrance, using a well-chosen hash function.
Stephan van Hulst wrote:Load factor and capacity have nothing to do with collisions. Collisions are purely affected by the quality of the hashCode() method.
I disagree. There are two types of collisions:
1. Two objects have the same hash code, and the HashMap needs to check the equals() method to see if they are the same.
2. Two objects have different hash code, but end up in the same bucket anyway (because the current capacity is less than Integer.MAX_VALUE). In this case, the HashMap just needs to check the actual hashCode() of each object (which was saved when the object was put in the Map) to see if they are the same.
The first type of collision is determined only by hash code, sure. But the second certainly is affected by the capacity, which is affected by the load factor.
In general, setting a lower load factor will decrease collsions (of the second type, which is much more common) at the cost of using more memory. It's a trade-off.
kri shan wrote:if two objects have different hash code, but end up in the same bucket anyway (because the current capacity is less than Integer.MAX_VALUE). I
I do not think so.
if two objects have different hash code, then it should placed in the different hash bucket.
Well, no. There are 2^32 different hash codes possible. If you require different hash codes to go into different buckets then you would have to have 2^32 buckets. Obviously that doesn't make any sense at all.
If you read the code, you find the original hash code is rehashed, and then AND-ed with capacity - 1. That means in a capacity 16 Map, you execute rehash & 0x000f; in an array-based implementation, you use that result to get the index in the array where the bucket is. That means any rehashes whose least significant 4 bits are the same will suffer a hash collision. When you get to 12 members (if the load factor is 0.75), the capcacity is changed to 32, and you repeat the insertion with rehash & 0x001f instead. Now a hash collision occurs if the 5 least significant bits are the same.
You mustn't worry about hash collisions. They are not some sort of collision on the roads; they don't do any harm.
Well, they can do "harm" in the sense that certain use cases can have very poor performance. (This was particularly true before they introduced the rehash.) Having a few hash collisions is no big deal, but having a lot of them can be bad.