aspose file tools*
The moose likes Java in General and the fly likes two string with same hashcode Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "two string with same hashcode" Watch "two string with same hashcode" New topic
Author

two string with same hashcode

priyanaka jaiswal
Ranch Hand

Joined: Jun 03, 2011
Posts: 79

How to make two different string to have same hashcode?




Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 13875
    
  10

That's not so easy. You'd have to know how String.hashCode() calculates its return value exactly. You can look this up, the source code of class String is available in src.zip which is in your JDK installation directory.

Note, however, that it is made in such a way that different strings have different hash codes as much as possible - so it will still not be easy to find two strings that result in the same hash code.

(In Java, the class is named String, not string - the capital S is important).


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4240
    
    7

The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version. I suppose the obvious question is, why do you want to do this? There's probably a better way of doing what you want to do.
Junilu Lacar
Bartender

Joined: Feb 26, 2001
Posts: 4419
    
    5

priyanaka jaiswal wrote:
How to make two different string to have same hashcode?

Can you shed some light on your reason for wanting to do that? I'm always curious to learn why people want to go against the grain and set themselves up for potential problems. Kind of like, "Yes, you can read that book upside down but why would you want to do that?"

As it is implemented, it might be impossible for two String objects which are unequal according to equals() to have the same value for hashCode but the contract for hashCode() states that this is NOT a requirement. That is, the values returned by hashCode() do not have to be different if two objects are unequal according to equals but it helps improve performance of hashtables if they are.


Junilu - [How to Ask Questions] [How to Answer Questions]
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4240
    
    7

Junilu Lacar wrote:As it is implemented, it might be impossible for two String objects which are unequal according to equals() to have the same value for hashCode


It's definitely possible, because there are more possible Strings than possible hash code values. But that doesn't mean it's useful .
Junilu Lacar
Bartender

Joined: Feb 26, 2001
Posts: 4419
    
    5

Matthew Brown wrote:It's definitely possible, because there are more possible Strings than possible hash code values. .

Ah yes, makes sense. A quick search turns up "FB" and "Ea" as a typical example.
Yuriy Fuksenko
Ranch Hand

Joined: Feb 02, 2001
Posts: 413
It depends on the way you going to create/use those strings.

Look at the following code, for example:


Here is the output:

true
true
false
true
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
The char variable has a range of 65535 values

The hash value of a string is given by
h(i+1) = h(i)*31 + (int)str[i];

Since 31 is less than 65535, there are a multitude of strings that have the same hash value.

Example, for a four character string, h = 29791 * (int)str[0] + 961* (int)str[1] + 31* (int)str[2] + (int)str[3]

So, a matching hash is given by making trivial changes. For example, increase str[2] by one, and decrease str[3] by 31

hash("abyz") = 2987778
hash("abz]") = 2987778



If the hash multiplier, 31, was larger than 65535, then there would be no values that generate the same hash.

In general, since the ascii codes range from 33 to 175, or 142 values, you could shift the next to last character by up to 3 places up or down, and adjust the last character to generate the same hash. If you use non ascii codes, or the full 65535 values, you could adjust the last four characters and obtain different strings that generate the same hash code.
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
Below is an example code to generate strings with matching hash codes.

Junilu Lacar
Bartender

Joined: Feb 26, 2001
Posts: 4419
    
    5

priyanaka jaiswal wrote:How to make two different string to have same hashcode?

I guess the only thing left to resolve now is whether the OP meant to generate unequal strings that have the same hashCode or change the hashCodes of two unequal strings so that they become equal (without changing the strings).

That is, by "make" did the OP mean "to create" as in "I made a copy of the file" or "to cause" as in "The magician made the car disappear"?
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
Since the hash value of a string is given by
h(i+1) = h(i)*31 + (int)str[i];

This expands to
h = ((int)str[0])*31^(length(str)-1) + ((int)str[1])*31^(length(str)-2) + ((int)str[2])*31^(length(str)-3) + ((int)str[i])*31^(length(str)-(i+1)) ... ((int)str[length(str)-2])*31) + (int)str[length(str)-1]

One approach to generating a second string with the same hash value is to modify the characters so that the same value comes up. If we limit the range of characters to the ASCII set, then this means that the second to last character can be changed by up to three values (i) and the last character changed by -i*31.

A second way that two different strings can generate the same hash is to use the range of int which is from -2,147,483,648 and 2,147,483,647 inclusive.. Since 31^7 = 27,512,614,111, this suggests that the hash value of a string with more than 6 characters would exceed the range of an int, causing folding, and providing a second methodology for matching the hash values of two different strings. If we add one to the maximum integer, we obtain the minimum integer. This suggests that the hash value is given by

ht = h(i)*31 + (int)str[i];
while (ht > 2147483647) ht = ht - 4294967295
h(i+1) = ht

Note that 4294967295 = 4*31^6 + 26*31^5 + 19*31^3 + 29*31^2 + 24*31 + 3

So, that suggests that if we start with a string with more than six characters and shift the seventh character from the end by 4, and the sixth character from the end by 26 and the fourth character from the end by 19 and the third character from the end by 29 and the second character from the end by 24 and the last character by 3, we should obtain a different string with the same hash value. Actually, the last character has to be shifted by 4.

The code below generates both types of string modifications to obtain new strings with the same hash code. The integer folding could be repeated to create additional different strings. The integer folding approach was not checked for generating only ASCII characters, so the printed characters may be misleading.



Output

Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4240
    
    7

Mike Simmons wrote:
Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.

Good point. I was lazy, and didn't check the API docs to see if the algorithm was specified: I just guessed (wrongly) that it probably wouldn't be.

(Realistically, I suppose it's unlikely anyone would bother changing the implementation even if it wasn't specified, since it seems to be "good enough" and has been around a long time).
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
It may be worth pointing out that since the hash code is an int, and there are 4294967295 values for int, all possible hash codes can be generated with a seven character string.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Matthew Brown wrote:
Mike Simmons wrote:
Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.

Good point. I was lazy, and didn't check the API docs to see if the algorithm was specified: I just guessed (wrongly) that it probably wouldn't be.

(Realistically, I suppose it's unlikely anyone would bother changing the implementation even if it wasn't specified, since it seems to be "good enough" and has been around a long time).


I seem to remember, back around Java 1.1 or 1.2, String's hashCode() only looked at the first 8 characters. I assume this was because there was some truth to the "Java is slow" mantra back then. This led to some really awful HashMap performance.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12678
    
    5
Right you are Jeff, there was a change between early versions of Java.


Calculating a string hashcode is always a tradeoff between speed and hashcode "quality."


Bill

Java Resources at www.wbrogden.com
drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 60
John Vorwald wrote:It may be worth pointing out that since the hash code is an int, and there are 4294967295 values for int, all possible hash codes can be generated with a seven character string.


if it's only for alphabets, seven character string could represent totally 26^7 = 8031810176 which is bigger than all the int number 2^32 = 4294967296

it might be sufficient though, it's possible to find two different strings with the same hashcode.


science belief, great bioscience!
Steve Fahlbusch
Bartender

Joined: Sep 18, 2000
Posts: 556
    
    7

But that doesnt mean that each of those have a unique hash code - only the max amount that can be handled as primary hash values.
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
drac yang wrote:
John Vorwald wrote:It may be worth pointing out that since the hash code is an int, and there are 4294967295 values for int, all possible hash codes can be generated with a seven character string.


if it's only for alphabets, seven character string could represent totally 26^7 = 8031810176 which is bigger than all the int number 2^32 = 4294967296

it might be sufficient though, it's possible to find two different strings with the same hashcode.


The question refers to String hash codes, which are given by sum((int)(char[i]) * 31^(nChar-i)) with mod applied when the value exceeds Integer max value.

If we use the printable ascii char, the values range from 32 to 127

We see that 127 * 31^(6-1) = 3607273026, which is less than 4294967296. There needs to be a 7th char to get the required amount of hash values.

The number of hash codes doesn't depend on 26^nchar, but rather (int(char)*31^(nchar-1)). The 26^nchar is the number of possible arrangements of single case (lower or upper) words, but not the number unique hash values.

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36498
    
  16
Indeed we have seen that some Strings return the same hash code, so you might require more than 7 letters to exhaust all the possible hash codes.
Sounds like some complicated maths to solve a non‑problem.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Campbell Ritchie wrote:Indeed we have seen that some Strings return the same hash code, so you might require more than 7 letters to exhaust all the possible hash codes.

No, you don't. The range covered is continuous across integers, with no gaps. Except as noted below, for a 6-character string.

Using chars in the range 32-127, here are the ranges of possible hash values:


Only the last two lines are affected by overflow. After overflow, the line for length 6 splits into ranges -2147483648 to -27554432 and 1073741824 to 2147483647. The line for length 7 covers all possible ints.

Campbell Ritchie wrote:Sounds like some complicated maths to solve a non-problem.

The math is simple enough - in each step we effectively just multiply the min and max by 32. If the problem isn't interesting to someone, they can always skip it. The original question was answered some time ago.
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
Mike,
That's an interesting response. I'm going to see if I can take it a little further, but include some code due to the details... It looks like 7 characters are needed to cover the full range.

John



Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18127
    
    8

Or to put it more simply:

There is 1 string with 0 chars.

There are 2^16 strings with 1 char.

There are 2^32 strings with 2 chars.

There are 2^48 strings with 3 chars.

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36498
    
  16
How do you know there are 2¹⁶ different Strings with 1 character? There are not 2¹⁶ different Unicode characters between 0 and \uffff. Unicode has gaps in, and some of the characters which exist are supplemental characters. So the possible range of hashcodes is all full of gaps. Eventually the ranges will overlap and fill those gaps, however.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Campbell Ritchie wrote:How do you know there are 2¹⁶ different Strings with 1 character? There are not 2¹⁶ different Unicode characters between 0 and \uffff. Unicode has gaps in,


Does that stop us from constructing a String out of nonexistent characters? I'm not being snarky. I honestly don't know, and can't be arsed to find out for myself.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18127
    
    8

Jeff Verdegan wrote:
Campbell Ritchie wrote:How do you know there are 2¹⁶ different Strings with 1 character? There are not 2¹⁶ different Unicode characters between 0 and \uffff. Unicode has gaps in,


Does that stop us from constructing a String out of nonexistent characters? I'm not being snarky. I honestly don't know, and can't be arsed to find out for myself.


A char can contain any of 2^16 possibilities. It's true that some of those don't have meanings in Unicode, but nevertheless they can be created in Java.

And a String can be made up of any list of chars. It's true that some of those may not even be valid in Unicode because they misuse surrogate pairs, but nevertheless they can be created in Java.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36498
    
  16
I think Paul C is correct. You can create a String out of non‑Unicode characters, in which case there are 2¹⁶ possibilities per char, even if you cannot display such a String. I shall have to try it.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Yes, it's part of the Java definitions of char and String. There's no requirement that a char have any defined meaning in Unicode. Here's an example:

From the Unicode chart here, you can see that FF00 has no defined value, while FF01 is a full-width exclamation mark (!). When I try to print these, I get a "＀!" indicating the system doesn't know what to put for the first value. But it's still a valid String, with no exceptions thrown.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Paul Clapham wrote:Or to put it more simply:

[...]

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.

I think we all agree that there are multiple strings per hashcode - John and I are discussing whether there are any hashcodes that can not be achieved by a String of a given length, with a "printable ascii" restriction. I guess this actually excludes 127, since that's a delete char. Anyway we seem to agree that by the time you get to 7 chars, all possibilities are covered. Obviously, if more chars values are allowed, that goal can be achieved with fewer chars.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36498
    
  16
Try it and see
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Looks like another way to form a legal Java String out of characters that don't have meaning in Unicode (not when put together that way, at least). Which continues to support Paul's point about each char being one of 2^16 possibilities, regardless of their Unicode validity, no?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18127
    
    8

Mike Simmons wrote:John and I are discussing whether there are any hashcodes that can not be achieved by a String of a given length, with a "printable ascii" restriction.


Ah yes, that's a much less trivial problem. And therefore more interesting. I don't see any reason for restricting the contents to printable ASCII characters, though, although I suppose that makes the coding a bit simpler.
John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139
The original question was about different strings that had the same hash code. I sat in a facial recognition meeting last week, and part of the meeting dealt with patterns that were used to create binary scores to quickly locate the face and/or facial features in a digital photo. I wonder if that approach could be applied to the problem of finding different strings that have the same hash code. My simple understanding of the approach was to try a pattern on a number of test cases, and if the pattern had better than 51% chance of success (or failure in which the criteria was reversed), then the pattern could be used. Just a thought, the program isn't obvious to me, so maybe another day....
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36498
    
  16
Mike Simmons wrote:Looks like another way to form a legal Java String out of characters that don't have meaning in Unicode (not when put together that way, at least). . . .
Yes, I tried that in an attempt to confirm Paul C’s point and to see whether I was mistaken there.
drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 60
Paul Clapham wrote:Or to put it more simply:

There is 1 string with 0 chars.

There are 2^16 strings with 1 char.

There are 2^32 strings with 2 chars.

There are 2^48 strings with 3 chars.

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.


since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the hashcodes in like the long type instead of int?
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

drac yang wrote:
since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the hashcodes in like the long type instead of int?


It still wouldn't even be remotely sufficient for relatively short strings, by many orders of magnitude. And if you have an object with, say, two Strings and a Date (e.g. private String firstName ; private String lastName; private Date birthDate;), then what?

And hashCode is so commonly used and so firmly entrenched they're not going to break backward compatibility at this point.

But lets say we start fresh with a new Java. There's still little or no benefit to making hashCode a long. HashCode determines which bucket an object goes into. That means we can have up to 2^32 buckets. Currently, hash based data structures have to re-hash the 4-byte hashCode to a smaller number, because clearly we're not going to create an array of 4 billion entries for our buckets. When multi-terabyte RAM computers are common, and we're storing hundreds of billions of objects in HashMaps, then maybe an 8-byte hashCode would make sense.

I suppose there could even be specialized situations--"big data" type stuff--where it could be useful now. But those situations call for a different tool than Java, or at least writing your own hashing facility, not for breaking backward compatibility in today's Java.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7057
    
  16

drac yang wrote:since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the
hashcodes in like the long type instead of int?

No. You're missing the point. Hashcodes are probablistic - that is, a good one is supposed to provide a good chance of being different if the underlying object is different.

An int has 4 billion possible values. Given that the chances of you ever dealing with more than a few thousand objects at a time is small, an int is perfectly adequate for most situations.

Remember: the idea of a hashcode is not to never return the same value for different objects; just to make it unlikely. As far as Java hashed collections are concerned, the rest of the algorithm is dealt with by equals() (and hence the tie-in between the two methods).

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Artlicles by Winston can be found here
drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 60
Winston Gutkowski wrote:
No. You're missing the point. Hashcodes are probablistic - that is, a good one is supposed to provide a good chance of being different if the underlying object is different.

An int has 4 billion possible values. Given that the chances of you ever dealing with more than a few thousand objects at a time is small, an int is perfectly adequate for most situations.

Remember: the idea of a hashcode is not to never return the same value for different objects; just to make it unlikely. As far as Java hashed collections are concerned, the rest of the algorithm is dealt with by equals() (and hence the tie-in between the two methods).

Winston

yeah, i agree especially on for most situations, problem set scale will be far less than 4 billion.

like the indexFor method in HashMap shows:


only if when the problem set grows, say, close to or more than 4 billion, at that time,
for the sake of optimized loadfactor, people might begin to consider changing to long hashcode(if without the impact of backward compatibility like jeff said).

good reply, thanks.
drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 60
Jeff Verdegan wrote:
It still wouldn't even be remotely sufficient for relatively short strings, by many orders of magnitude. And if you have an object with, say, two Strings and a Date (e.g. private String firstName ; private String lastName; private Date birthDate;), then what?

And hashCode is so commonly used and so firmly entrenched they're not going to break backward compatibility at this point.

But lets say we start fresh with a new Java. There's still little or no benefit to making hashCode a long. HashCode determines which bucket an object goes into. That means we can have up to 2^32 buckets. Currently, hash based data structures have to re-hash the 4-byte hashCode to a smaller number, because clearly we're not going to create an array of 4 billion entries for our buckets. When multi-terabyte RAM computers are common, and we're storing hundreds of billions of objects in HashMaps, then maybe an 8-byte hashCode would make sense.

I suppose there could even be specialized situations--"big data" type stuff--where it could be useful now. But those situations call for a different tool than Java, or at least writing your own hashing facility, not for breaking backward compatibility in today's Java.


yeah, besides the backward compatibility issue(but why can't those legacy code use old jdk for maintenance?), another issue would be space waste, right? for most situations problem set is way under 4 billion.
good reply, thanks.
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

drac yang wrote:yeah, besides the backward compatibility issue(but why can't those legacy code use old jdk for maintenance?)


You're missing the point. I might get Java 8 as soon as it's available, and build my app with it from day 1. If Java 9 changes hashCode() to a long, then I cannot use it without changing all my code--and all the code in every library I use. It's not just old stuff that breaks. It's anything from before the change.

another issue would be space waste, right?


Not really. It's only 4 more bytes, which is at least an order or magnitude less than a lot of the objects it will be generated from anyway, and memory is cheap, and it's not like we typically keep large numbers of hashcode values around anyway. There are several compelling reasons not to change to an 8-byte hashCode(). Saving 4 bytes is not one of them.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: two string with same hashcode
 
Similar Threads
hashSet
How many objects are created?
Java HashCode Problem
How to convert String to int ?
why hashcode same for String not for Class Object?