I've just listened to a nice MIT lecture on algorithms, where they describe Red/Black balanced trees. This is the algorithm that the Sun Javadocs say is the implementation for the TreeSet/TreeMap code. It nice because its Big-Oh( ( ln N).
I generally use HashSet/HashMap, which is obviously Big-Oh( 1 ) but that ignores the constants. I expect that the constants are not tiny for most implementations of hashCode() and the management.
Most of the time (maybe nearly always) there is going to be no practical impact.
But my curiosity is raised. Has anyone looked at this enough to come up with some rules of thumb when its better to use TreeMap rather that HashMap?
Unless my wits are scrambled (again), a HashMap offers non-collated data. A TreeMap is an extension of that paradigm that permits element access in a fixed and pre-defined order. So the rule is that if you intend to iterate over a Hash and you want a specific sequence, use the TreeMap. If any old order will do, save the overhead that ordering the iteration results incurs and use the HashMap.
Customer surveys are for companies who didn't pay proper attention to begin with.
Yes, they define different access methods. And the TreeMap requires that you implement Comparable for the objects.
But I was asking in the abstract. Its clear that calculating and manipulating a Hash takes time. For example, on a String, it looks at every character. The Compare operation can look at as few as one character. So I was wondering for what numbers N does O(1) > O(ln N)
Although this may not be true for all JVMs, I think the hash value may be a builtin part of the String object in the Sun JVM. It's immutable, so if the storage-versus-CPU tradeoff is worth it, that's practical. In any event, I've never heard anyone complain about the performance. Since the JVM does things to optimize how it handles - and replicates - strings, the cost may very well be paid back long before you started working with collections.
A TreeMap is built on top of a HashMap, so it has the overhead of a HashMap plus the overhead of the ordering mechanism. A TreeSet according to the JavaDocs is built on top of a Tree[b]Map[/i], so it would likewise carry HashMap overhead.
Since collections are so ubiquitous and so criticial, considerable work has been done to make them efficient by Java's designers. Although as a general rule, you'll do best using the simplest mechanism required for the task, the ultimate test is how it performs under your actual production load. In programming, as in philosphy/politics, making major decisions on what "should" be true instead of determining what actually is true can lead to unfortunate results.
Tim Holloway wrote:Still, it's usually going to be less overhead to calculate a hash than to balance a tree (YMMV).
There is more to a hash structure than just calculating the hashCode, you have to handle confilct/collision, resize the map and perhaps change the mapping function as the hash set fills.
The beauty of a red/black tree is that its always balanced, always O (ln N) and fixups are really fast, localized and easy. Its really a very clever structure.
I started out with a bias similar to Tim's, but as I learn more about Red/Black trees, the more impressed I am.
The right tool for the job. In actuality, most Hashmap usage doesn't get that sophisticated. The hashtable size is generally fixed, as is the hash algorithm. When the hash value is cached with the object, the overhead becomes minimal, assuming the object gets referenced multiple times. Hash collisions are generally managed via singly-linked lists, which can be traversed very rapidly, since often there's no more than 3 or 4 elements in even the longest overflow list and don't require a whole lot of logic to maintain. For absolute maximum access, you can calculate a "perfect hash", which is simply the art of designing the hash to eliminate collisions. There are a number of good articles on how to do this.
This isn't to say that I always use hashes and never use balanced trees. Hashes are often a lot less memory-efficient, for one thing. I've seen a number of databases where the selection of hash versus tree was selectable on a per-index basis.
There's no "best" answer, so do your homework, measure performance, and above all, stay flexible.
And, if it's not a performance-critical part of the system, don't agonize too much over whether you made the "right" choice. Software offers enough agony without making up any more.
subject: Has anyone bothered to develop rules of thumb for HashSet vs TreeSet