File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes hashCode of class String Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "hashCode of class String" Watch "hashCode of class String" New topic
Author

hashCode of class String

Wolfgang Troescher
Greenhorn

Joined: Feb 12, 2010
Posts: 14
Hello!

I hava a question about the implementation of the hashcode algorithm in class String.

As we know, the implementation is: hashCode = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Just to play around a little bit, I tried to implement this algorithm by myself:



But: This implementation provides correct values just for short Strings (same results as s.hashCode() ), whereas the following provides always correct results:



Where is the difference between these implementations? I think in theory both must provide same results. I´m sure I overlook something...

Wolfgang
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14157
    
  19

It's just a guess, but it could be this: Math.pow() works on floating-point numbers of type double. The data type double has limited precision. You might get rounding errors which will give you a different result than when you'd use integer computations.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38902
    
  23
You don’t calculate powers of 31 like that at all. You do something like this:-Of course in an immutable class you would probably cache a hash (what a dreadful rhyme) rather than calculating it every time it is used.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38902
    
  23
Your use of left shift and - may be what the * 31 bit is optimised to.
Wolfgang Troescher
Greenhorn

Joined: Feb 12, 2010
Posts: 14
Jesper de Jong wrote:It's just a guess, but it could be this: Math.pow() works on floating-point numbers of type double. The data type double has limited precision. You might get rounding errors which will give you a different result than when you'd use integer computations.


I tried it with BigDecimal instead of int/floats, but the same results...

For example: with test-String "abcde" my method provides the same results like hashCode()-Method (92599395), but if I add a character more, let´s say an 'f' (abcdef), I get different results (my method: 2147483647, hashCode(): -1424385949).
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38902
    
  23
What happens if you keep going with Math.pow is that you will eventually exceed the maximum value of an int. You have probably missed out a cast to (int) somewhere. Casting such a double will produce 2147483647, so you will be repeatedly multiplying by that number and adding. That is almost certainly not how the hashCode method is implemented. It is probably implemented as I showed it, and you showed a similar algorithm with a premature optimisation combining << and -.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3015
    
  10
Campbell Ritchie wrote:Your use of left shift and - may be what the * 31 bit is optimised to.

Yes, definitely. (x << 5) is the same as x * 2^5 which is x * 32. Therefore ((x << 5) - x) is the same as (32 * x - x), which is 31 * x.

Campbell Ritchie wrote:and you showed a similar algorithm with a premature optimisation combining << and -.

Why assume it's premature? Could be it was well tested back when the optimization was made, and made sense for the hardware and software it was run on. At least, Josh Bloch thought so. This is the original algorithm used in String.java. It's not necessary in the modern JDK code, because as Bloch notes in Effective Java, "modern VMs do this sort of optimization automatically." But calling it premature seems uncharitable, considering Wolfgang is probably just reading his copy of Effective Java where this very code is discussed.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38902
    
  23
I thought Bloch says it is probably optimised to h << 5 - h. But you cannot be sure that is how it is optimised. Anyway, as Brian Goetz reminds us, “write dumb code”. If there is such an optimisation for * 31, it will be used anyway. If, however, somebody ever invents a better optimisation, then using h << 5 - h will miss that optimisation. So trying to micro‑optimise in code like that is unnecessary in my opinion.

Sorry if I was uncharitable or rude or anything.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

If I recall correctly, and if the state of the art hasn't changed in this respect since I learned it, the shift and add can be significantly more performant than the equivalent multiplication. I think this is due to pipeline stall introduced by multiplication. A smart compiler (as in hotspot, not javac) would probably recognize multiplications by 2^N +/- 1 and convert to the equivalent shift operation if it's appropriate to do so on the underlying hardware. So you're probably safe with X * 31, but in a case like hashCode(), where it could be called often, can be doing lots of multiplications, and specifically exists to improve performance, I don't think it's necessarily premature to explicitly write out X << 5 - X.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3015
    
  10
Campbell Ritchie wrote:I thought Bloch says it is probably optimised to h << 5 - h. But you cannot be sure that is how it is optimised. Anyway, as Brian Goetz reminds us, “write dumb code”. If there is such an optimisation for * 31, it will be used anyway. If, however, somebody ever invents a better optimisation, then using h << 5 - h will miss that optimisation. So trying to micro‑optimise in code like that is unnecessary in my opinion.

I don't see a "probably" in Bloch's entry on that topic, but I agree we don't know for sure. I think Bloch was quite confident it would be done using the then-current JDK from Sun, but he couldn't know what changes might be made in the future. Then too, the JIT may not optimize it at all until the code in question has been run a few times - that's the nature of just-in-time compilation. So yes, I agree with you on all points. Except to point out that the poster you're responding to probably wasn't optimizing the code at all - he was analyzing code that someone else had written.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: hashCode of class String