This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.

the following words and codes are quoted from HashMap

what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables? i don't think it's a reason, because even if it was not power-of-two, it would still encounter collisions for hashCodes that do not differ in lower bits.

thanks.

science belief, great bioscience!

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted

0

maybe it's just an extra explanation, not a strong cause?

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted

0

well, how could this function

h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);

ensure a bounded number of collisions? from discarding the lower bits first?

First of all, this is implementation detail ... so for other implementations, either a different JVM or different version of Java, this may be moot.

drac yang wrote:
what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables? i don't think it's a reason, because even if it was not power-of-two, it would still encounter collisions for hashCodes that do not differ in lower bits.

Basically, it is trying to say, although somewhat poorly, that the implementation only cares about the lowest bits of the hashcode. Initially, only the lowest four bits; then later, the lowest five bits; then later, the lowest six bits, as the collection is growing the number of buckets. This means that the user's implementation of hashcode, while may be very good at getting a good distribution, but it won't matter, if it doesn't get a good distribution of the lowest bits.

The reason the collection rehashes is to try to get the distribution from the highest bits, and from the middle bits, into the lowest bits. This way it doesn't matter where this good distribution is -- it will be in the lowest bits.

h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);

ensure a bounded number of collisions? from discarding the lower bits first?

Well... it can't. The goal of the method isn't to ensure that the hashcode won't collide. The goal isn't even to ensure that the number of collision is bounded. If you write a crappy hashcode method, such as returning a constant, it will collide regardless of this rehashing.

The goal here is ... if a user does a good job at providing a hashcode with a good distribution, the collection won't throw it all away, by simply using the lowest bits.

Henry

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted

0

Henry Wong wrote:

Well... it can't. The goal of the method isn't to ensure that the hashcode won't collide. The goal isn't even to ensure that the number of collision is bounded. If you write a crappy hashcode method, such as returning a constant, it will collide regardless of this rehashing.

The goal here is ... if a user does a good job at providing a hashcode with a good distribution, the collection won't throw it all away, by simply using the lowest bits.

Henry

hi henry, thanks very much, i basically got what you meant by getting distribution from high and middle bits into lower bits, but why does it want to discard lower bits? because lower bits would collide more? i dont know why lower bits would collide more
besides, in the comment, it did mention bounded collide

// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).

what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables?

Because you don't understand why collision happens.

HashMapmust use power-of-two length backing array in order to calculate index for hash value generated by hash() method using key's hashCode.
This is a place wheer collision happens - when index of backing array is generated from hash value in indexFor method.
Code snippet you provided is from hash method that takes key's hashCode to generate good value which should decrease amount of collisions due to badly overriden hashCode. String is bad HashMap key because it has weak hashCode calculation.
Collision is happening in indexFor method when calculating modulo of division of hash value by backing array length.
Thus index returned by indexFor is always within bounds of backing array and of course indexFor might return the same index.
indexFor generates index at which your mapping is stored in backing array and it can return the same index.
You should understand indexFor method to understand where collision happens.

Volodymyr, the term "rehash" is used to mean a couple different things - even within the HashMap source code. Look at the JavaDoc comments for HashMap, and you see that "rehashing" is used to refer to the process of copying entries into a new table - the thing you just called transferring. Yes, they use "rehash" to mean something different, internally, within the source. Which I would ignore for purposes of public discussion - the JavaDoc is what people are supposed to read, to learn how a class works.

This topic is actually about something else, which is the supplemental hashing performed by HashMap.hash(Object). This supplemental hashing is, in fact, always used, regardless of any system property. However, one part of the algorithm (new in JDK 7) is referred to as alternative hashing. That (alternative) is the thing that is (a) only for Strings, and (b) only enabled if the property jdk.map.althashing.threshold is set. But supplemental hashing is used regardless; alt hashing is just one aspect of it.

When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

Really you are right they call transferring a rehash.
But it is very important to understand that no hash value is computed during resizing of backing array but mappings are just copied.
Meanwhile there is special system property you pointed out that if changed really means that
new hash value will be calculated for every key when resizing.

This topic is about this :

what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables?

My reply is:
power-of-two length is used for method indexFor to work correctly and just indexFor can generate the same index of backing table for different mappings you give via put method. This is the reason we can have multiple Entries at the same bucket.

A few words not about topic:
Hey! Java Ranch ! why I do not receive email notifications when answer is posted after my answer is posted. This is already for a few days so. Please do smth because I set my Preferences correctly. In fact I did not change them but for a few days I do not receive emails when answer is posted Thanks JavaRanch!

Volodymyr Levytskyi wrote:
This topic is about this :

what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables?

My reply is:
power-of-two length is used for method indexFor to work correctly and just indexFor can generate the same index of backing table for different mappings you give via put method. This is the reason we can have multiple Entries at the same bucket.

Unfortunately, this is circular reasoning. The implementation of Hashmap simply uses a power of two as possible capacities. And the indexFor() method, which maps hashcodes to buckets, is designed around that -- meaning since it is a power of two, it can use bit shifting to do the mapping. It is not the indexFor() method that determine the possible capacities. The indexFor() method is simply written for the design.

If the hashmap didn't use a power of two, and had fine grain control of capacity, then the implementation of the indexFor() method probably would had to use an algorithm that included the Java remainder operator instead. And arguably, the implementation would had been just as simple.

And the reason there can be multiple entries in a single bucket is because of the link list design of the bucket. This has nothing to do with the number of buckets, or how the entries are hashed.

Power-of-two length backing table is used for method indexFor to work correctly

You need to see at least this method. Here it is :

You see remainder of division(%) is used as result of this method due to &.
Using & correctly would be impossible if length backing table was not power-of-two .

Please read this small part in wikipedia .
This explains that programmatic & or mathematical % return the same result only if power-of-two length bcking table is used.

The implementation of Hashmap simply uses a power of two as possible capacities.

Maybe I am wrong, sorry, but I believe that power-of-two length is deliberately used for method indexFor to work correctly.
What does indexFor do ?
It determines index of backing table for every new mapping. And just this method produces identical indices for different mappings.
In HashMap everything is rrotating around indexFor method. Meaning hash() method and power-of-two length are for indexFor method
to produce different indices for different mappings.
Of course I may be wrong but where...

@Henry Wong please tell java ranch that I do not receive email notifications when new answer is posted.
Thanks!

Volodymyr Levytskyi wrote:Maybe I am wrong, sorry, but I believe that power-of-two length is deliberately used for method indexFor to work correctly.

Actually, I think it's more likely to make it work fast - although there may be some aspects of "retention of randomness" involved.

Assuming that all the mangling ops that happen to the hash before it gets to this point have produced a reasonably random number, then 'hash % length' should work just as well; but it's much slower than a simple masking op. And speed is an essential part of why we use hashed collections.

However, it is also possible that an unlucky choice of length might highlight deficiencies in the "bit mangling" process that a simple mask doesn't, leading to skewed bucket distributions.

Volodymyr Levytskyi wrote:Maybe I am wrong, sorry, but I believe that power-of-two length is deliberately used for method indexFor to work correctly.

Actually, I think it's more likely to make it work fast - although there may be some aspects of "retention of randomness" involved.

Assuming that all the mangling ops that happen to the hash before it gets to this point have produced a reasonably random number, then 'hash % length' should work just as well; but it's much slower than a simple masking op. And speed is an essential part of why we use hashed collections.

However, it is also possible that an unlucky choice of length might highlight deficiencies in the "bit mangling" process that a simple mask doesn't, leading to skewed bucket distributions.

I am not sure if I even agree with the "fast" argument either. Okay, I do agree that it was probably the reason -- I just don't agree that it was a good decision (if it is true) ... Yes, it is faster. The AND operator probably maps to the native AND instruction, and hence, can be done in a single cycle. The remainder operator probably maps to the native DIV instruction (on Intel), and the last time I checked, it took about a dozen cycles. Of course, this was before Nehalem, so it could be much better now. Regardless, on a 3Ghz machine, one cycle is about 300 picoseconds, so we are talking about 3 nanoseconds slower. Anyway, considering the many other things that are done during a rehash, and the fact that a rehash is an operation that only takes place when the map needs to resize, I can't imagine that this would even be noticeable.

As a side note, why do I dislike the design? Basically, I got burned by it. The problem with the power of two algorithm, is that it is more of a memory hog as it gets bigger. The number of buckets double in size as it needs more buckets. On the other hand, as you need more buckets, you also have a need to be more efficient with memory (as you are using more). So, this inefficiency happens at a bad time.

Volodymyr Levytskyi wrote:
Please read this small part in wikipedia .
This explains that programmatic & or mathematical % return the same result only if power-of-two length bcking table is used.

Yes, with power of two, I can easily do a mapping with both operators. Without it, I can only easily do it with the remainder operator.

Volodymyr Levytskyi wrote:

The implementation of Hashmap simply uses a power of two as possible capacities.

Maybe I am wrong, sorry, but I believe that power-of-two length is deliberately used for method indexFor to work correctly.
What does indexFor do ?
It determines index of backing table for every new mapping. And just this method produces identical indices for different mappings.
In HashMap everything is rrotating around indexFor method. Meaning hash() method and power-of-two length are for indexFor method
to produce different indices for different mappings.
Of course I may be wrong but where...

Well, we don't have the hashmap designer here, so this part is speculation... but which is more likely...

(A) A power of two design was done, most likely on the belief that it is a noticeable faster design. The coding was done, and from that the developer wrote lots of private implementation methods, including the indexFor() method.

(B) The designer really liked that a bit mask with a single bit, when subtract by one, turns on all bits lower than that bit. And then proceed to design something that would use that idea.

I guess choice B is possible, but it is not reasonable. It is obviously a case of the "tail wagging the dog" argument here.

Volodymyr Levytskyi wrote:
@Henry Wong please tell java ranch that I do not receive email notifications when new answer is posted.

Henry Wong wrote:As a side note, why do I dislike the design? Basically, I got burned by it. The problem with the power of two algorithm, is that it is more of a memory hog as it gets bigger. The number of buckets double in size as it needs more buckets. On the other hand, as you need more buckets, you also have a need to be more efficient with memory (as you are using more). So, this inefficiency happens at a bad time.

Hey, you get no argument from me. It feels like an "optimization too far" to me as well.

However, I will say this:
1. I remember seeing docs about that 'pre-mangling' that say it's an attempt to "make poor hashes work better" (???).
2. My maths simply isn't good enough to know whether that 'pre-mangling' might skew bucket hits if used with "% length" (actually, "% capacity"); however, it seems unlikely to me that a simple mask would make it any worse.

What about this? Forget the mangling and use '% capacity', BUT make the initial bucket capacity equal to
int(requested_capacity / loading_factor) + 1 something that the code (at least up to v6) signally fails to do.

Volodymyr Levytskyi wrote:Maybe I am wrong, sorry, but I believe that power-of-two length is deliberately used for method indexFor to work correctly.
What does indexFor do ?
It determines index of backing table for every new mapping. And just this method produces identical indices for different mappings.

I got a bit lost in your reasoning. If two mappings provide identical indices for each and every hash, they are - by definition - identical. Different mappings provide different indices for at least some hashes, don't they?

The indexFor simply computes an index for given hash and size. It just needs to be deterministic (so that the hash can be later found in the bucket it was originally put to). When using a modulo any length would generally work. However, different divisors might lead to different average number of collisions. I believe that prime numbers as divisors in the modulo operation are on average best for avoiding collisions. I also guess that using powers of two ensures that the number of collisions are more or less consistent as the size and capacity grows (every doubling of capacity adds one more bit to the effective length of the key); using a different system of sizes (say, a sequence of primes) might require a lot of effort and some number theory to make sure that the average number of collisions is consistent across different sizes/capacities.

I also guess that using powers of two ensures that the number of collisions are more or less consistent as the size and capacity grows

This really might be the reason why power-of-two length backing array is used - to reduce the number of the same indices produced by indexFor method.
I agree. But the AND operation(&) used by indexFor method requires power of two length to search for modulo of division as described here

what is that binary value
1000
in decimal?
And what is the binary value
10000
?

I understand this 8 and 32.
What does effective length of the key mean.
Key you provided to store in map is not changed and no bits are added to your key.

I agree that capacity is doubled to have indexFor method generate different indices thus different buckets.

Volodymyr Levytskyi wrote:But the AND operation(&) used by indexFor method requires power of two length to search for modulo of division as described here

Yes, it does, but to me (and others in this thread) this alone doesn't seem a convincing enough reason to use powers of two.

If you want to use and, you have to use powers of two as a divisor.
If you merely happen to use powers of two divisors (for some reason), you may use and instead of modulo.

So there are several possible ways which might lead to HashMap using and instead of modulo, and it is not possible to reliably infer the reason just from the fact that it does so. Unless one is Sherlock Holmes; that chap did it all the time.

every doubling of capacity adds one more bit to the effective length of the key

What do you mean by this? Can you explain this?

It's simple actually. I'll try to illustrate it on an example. Consider two HashMaps, one with capacity of 16 and the other with capacity of 32:

The indexFor method will, in the first case, return just the lowest four bits of its parameter (15 in dec is 1111 in binary) - all other bits are masked away. In the second case, five lowest bits will be returned (31 dec is 11111 bin). So doubling the capacity means that one more bit of the h parameter is used when computing the index.

So, the number of bits just determines the range of possible values. In the powers-of-two scheme, the number of bits is always a whole number. If you use a scheme that is not a power of two, the resulting index will contain a fractional number of bits-worth of information from the original hash key. That doesn't make sense for individual calculation obviously, but can be important for statistical purposes.

The "why" questions are often tricky. In this case, as in many others, the answer really is: we don't know.

Given that this scheme clearly has some disadvantages (see Henry's response a few posts above), it is clear this is not the best of possible designs. To know why this particular one was chosen, we would have to contact whoever designed it that way and ask her/him. Anything else is just speculation.

However, I'm not sure I agree with the original post:

drac yang wrote:what i slightly don't understand is why it emphasized the reason that HashMap uses power-of-two length hash tables? i don't think it's a reason, because even if it was not power-of-two, it would still encounter collisions for hashCodes that do not differ in lower bits.

Modulo by a number other than power of two generally do not preserve value of the lower bits. There would still be collisions with multiples of the divisors used, but multiples of powers of two are, I believe, much more common in applications than, for example, multiples of primes. I think that the rehashing is there because of choosing the power of two as divisors, not the other way around.