Because the table uses power-of-two masking
Hash tables grow in powers of two. Imagine you have a map with both capacity and size 4
, and you add another element. The table will then double in size to accommodate the new element. An easy way to use the hash of an object to get an index into the table is to just mask the hash with the appropriate power of two, minus one. For example, if you have an object with a hash of which the least significant four bits are 1101
, and the capacity of the table is 8
, you perform 1101 & (capacity - 1)
to get 101
decimal), an index that fits inside the table.
sets of hashes that vary only in bits above the current mask will always collide
Imagine one object generates a hash code that ends in 00011101
, and another object generates a hash code that ends in 01011101
. If the capacity of the table is 8
, both hashes will yield index 5
, because only the least significant three bits (101
) are used. That means it's very likely that objects will collide if the least significant bits are generated from a field that rarely differs in value between objects.
So we apply a transform that spreads the impact of higher bits downward.
This means that the the most significant 16 bits are xor'ed with the least significant bits of the hash code, to increase the chances of two similar objects having different bit sequences that are used for the table index.
Because many common sets of hashes are already reasonably distributed (so don't benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.
Usually the hash codes that objects generate are already distributed well enough. And even if a collision occurs, the collision is resolved by performing an insert into a binary tree, which is reasonably efficient. That means that an extra transformation that makes sure that hash codes are uniformly distributed would be a useless waste of performance. Xor'ing the least significant bits of a hash code with its most significant bits is just a really cheap optimization that prevents a very specific class of hash codes from resulting in really poor hash table performance.