wood burning stoves 2.0*
The moose likes Performance and the fly likes Hey, I think I created the Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Hey, I think I created the "almost" perfect hash function" Watch "Hey, I think I created the "almost" perfect hash function" New topic
Author

Hey, I think I created the "almost" perfect hash function

Jeronimo Backes
Greenhorn

Joined: Sep 20, 2004
Posts: 29
I was fiddling with methods of validating the contents of Strings using checksums. And ended up combining some stuff that already exist in the SDK. For those interested here is the code:

I would also like to hear from you if you tested and tried to find any issue with it.

}

I've run many tests with random strings of random length and could not find a collision yet. This is one of the outputs I saved:

After 50,576,157 loops, Checksum got 0 duplicates( 0.0%)
After 50,576,157 loops, Hash got 252,668 duplicates( 0.026834579244767346%)


The output using random strings of fixed length is found a few collisions:
After 50,000,000 loops, Checksum got 2 duplicates( 4.0E-6%)
After 50,000,000 loops, Hash got 289750 duplicates( 0.5795%)

My tests generated so many strings that I had to run them with -Xms12G -Xmx12G.

The art of being crazy is to NEVER commit the crazyness of being normal.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

What are your definitions of the criteria for "perfect"

If you are just trying to implement good values for hashCode(), using a CRC is serious overkill. An important criteria for most real world applications is that the calculation of the hash has to be fast. If you take too long, you could just sequentially search the list rather than hashing into it. Just because its Big Oh(1) doesn't mean its actually useful.
Jeronimo Backes
Greenhorn

Joined: Sep 20, 2004
Posts: 29
Pat Farrell wrote:What are your definitions of the criteria for "perfect"

If you are just trying to implement good values for hashCode(), using a CRC is serious overkill. An important criteria for most real world applications is that the calculation of the hash has to be fast. If you take too long, you could just sequentially search the list rather than hashing into it. Just because its Big Oh(1) doesn't mean its actually useful.


Perfect hash functions are the ones that won't map two or more inputs into the same value. It means there is no possibility of collisions. These functions need to know the possible inputs in advance (e.g. EnumMap and EnumSet). This is not viable when using strings.

The usage of CRC in the code I've posted is limited to very short strings.

Creating a hash for 30,000,000 strings with random length (between 1 and 30) and random characters took 5 seconds using my method whereas String.hashCode took 2.5 seconds.

String.hashCode is twice as fast, but produces much more collisions in a massive range of random inputs.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Jeronimo Backes wrote:Perfect hash functions are the ones that won't map two or more inputs into the same value. It means there is no possibility of collisions.


What design space cares about this? A normal hashCode() implementation may have identical values, and all the usual hashSet/Map/Tree/.... handle it cleanly. If the number of collisions "small" relative to the number of entries in the hashset/map/tree, life is good.

What is the point of your attempt at a perfect hash function?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Jeronimo Backes wrote:I've run many tests with random strings of random length and could not find a collision yet.


Then that tells me that your test strings are not random, and neither are their lengths. The number of Strings is so much larger than the number of ints that it isn't possible to make a hash function for Strings which is anywhere near perfect.

The maximum length of a String is 2^31-1. So if you had chosen lengths randomly, then half of the lengths you chose should have been greater than half of that number -- which is 2^30-1. You didn't do that, did you? And likewise I'm willing to bet that the chars you chose randomly to put into your Strings weren't random chars at all -- remember that there are 2^16 possible values for a char.
Jeronimo Backes
Greenhorn

Joined: Sep 20, 2004
Posts: 29
What design space cares about this?

Dealing with huge sets/maps/hash tables where the number of collisions influence the performance of fetch operations.
Creating checksums to verify whether or not information has changed over time.

What is the point of your attempt at a perfect hash function?

It is not possible to create a perfect hash function for strings as the possible combinations is exceeds any reasonable numeric representation. The perfect hash of a string would be the entire contents of the String itself.
My intention here is to find a more reliable method of creating hashes for Strings where collisions are extremely unlikely on the execution space of a program that has to deal with billions of different Strings.

The maximum length of a String is 2^31-1. So if you had chosen lengths randomly, then half of the lengths you chose should have been greater than half of that number -- which is 2^30-1. (...) there are 2^16 possible values for a char.

Yep, there is no way to create a unique number to represent each combination if we are limited to Long.MAX_VALUE only. The intention here is to reduce the likelihood of producing the same number for different Strings during the execution of a program to almost 0.

The code I presented does that and in 50M strings it gets 2 collisions in average when using fixed-length strings, whereas hashcode alone produces almost 300K collisions. CRC32 alone was even worse than hashCode.
In many independent runs of 50M strings with length = Math.random() * 1000 I still couldn't get a single duplicate.

My method is not a "perfect" hash of course (it is impossible to have it as discussed), but it does a much better job in avoiding collisions compared to Java's hashCode and CRC32.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Have you tried SHA1 as a hash? Its probably as slow as your CRC, but its a cryptographically strong hash, meaning that its proven to have good ransomness in its results. You can simply truncate the long result into whatever length you want to store.

I'm not seeing why anyone would want to use your hash function on 50 million strings. There are far more space and performance dense data structures, this is a well understood field. Many many PhD thesis topics cover this area.
Jeronimo Backes
Greenhorn

Joined: Sep 20, 2004
Posts: 29
Pat Farrell wrote:Have you tried SHA1 as a hash? Its probably as slow as your CRC, but its a cryptographically strong hash, meaning that its proven to have good ransomness in its results. You can simply truncate the long result into whatever length you want to store.


Yep, SHA1 is reliable and so far I could not get collisions, but it is much slower. It took 25 seconds to create a hash for 30M strings. But thank you for the suggestion, it can be very useful.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7492
    
  18

Jeronimo Backes wrote:My method is not a "perfect" hash of course (it is impossible to have it as discussed), but it does a much better job in avoiding collisions compared to Java's hashCode and CRC32.

Hate to poke another hole in your theory, but all hash collections actually mangle your hash again in order to return a 'bucket index' for the current size. Have you actually proved that it produces fewer bucket collisions? I suspect it may do, but maybe not as much as you think.

Personally, I'd prefer to see some metrics that actually time 50,000,000 Strings rammed into a HashSet.

Like the old mantra says: Good/Fast/Cheap - pick any two.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
 
subject: Hey, I think I created the "almost" perfect hash function
 
Similar Threads
CRC algorithm
Checksum crc 16 modbus
Info on UDP checksums
About Adler 32 checksum
about actualize the CRC witn java