File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes Has anyone bothered to develop rules of thumb for HashSet vs TreeSet Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Has anyone bothered to develop rules of thumb for HashSet vs TreeSet" Watch "Has anyone bothered to develop rules of thumb for HashSet vs TreeSet" New topic
Author

Has anyone bothered to develop rules of thumb for HashSet vs TreeSet

Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

I've just listened to a nice MIT lecture on algorithms, where they describe Red/Black balanced trees. This is the algorithm that the Sun Javadocs say is the implementation for the TreeSet/TreeMap code. It nice because its Big-Oh( ( ln N).

I generally use HashSet/HashMap, which is obviously Big-Oh( 1 ) but that ignores the constants. I expect that the constants are not tiny for most implementations of hashCode() and the management.

Most of the time (maybe nearly always) there is going to be no practical impact.

But my curiosity is raised. Has anyone looked at this enough to come up with some rules of thumb when its better to use TreeMap rather that HashMap?
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15962
    
  19

Unless my wits are scrambled (again), a HashMap offers non-collated data. A TreeMap is an extension of that paradigm that permits element access in a fixed and pre-defined order. So the rule is that if you intend to iterate over a Hash and you want a specific sequence, use the TreeMap. If any old order will do, save the overhead that ordering the iteration results incurs and use the HashMap.


Customer surveys are for companies who didn't pay proper attention to begin with.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Yes, they define different access methods. And the TreeMap requires that you implement Comparable for the objects.

But I was asking in the abstract. Its clear that calculating and manipulating a Hash takes time. For example, on a String, it looks at every character. The Compare operation can look at as few as one character. So I was wondering for what numbers N does O(1) > O(ln N)
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15962
    
  19

Although this may not be true for all JVMs, I think the hash value may be a builtin part of the String object in the Sun JVM. It's immutable, so if the storage-versus-CPU tradeoff is worth it, that's practical. In any event, I've never heard anyone complain about the performance. Since the JVM does things to optimize how it handles - and replicates - strings, the cost may very well be paid back long before you started working with collections.

A TreeMap is built on top of a HashMap, so it has the overhead of a HashMap plus the overhead of the ordering mechanism. A TreeSet according to the JavaDocs is built on top of a Tree[b]Map[/i], so it would likewise carry HashMap overhead.

Since collections are so ubiquitous and so criticial, considerable work has been done to make them efficient by Java's designers. Although as a general rule, you'll do best using the simplest mechanism required for the task, the ultimate test is how it performs under your actual production load. In programming, as in philosphy/politics, making major decisions on what "should" be true instead of determining what actually is true can lead to unfortunate results.
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

Tim Holloway wrote:A TreeMap is built on top of a HashMap, so it has the overhead of a HashMap plus the overhead of the ordering mechanism.

As Pat said, TreeMap is implemented using a Red-black tree. It doesn't have any relation to HashMap, except that they both extend AbstractMap.

Note that the Sun implementation of String caches the hash code the first time it is calculated, so after the first time, the call to hashCode() takes constant time.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15962
    
  19

Hmmm. Time to see the optometrist, it seems. What I read in the JavaDocs last time and what I read this time aren't the same. But this time, what I read concurs with what you said.

Still, it's usually going to be less overhead to calculate a hash than to balance a tree (YMMV).
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Tim Holloway wrote:Still, it's usually going to be less overhead to calculate a hash than to balance a tree (YMMV).


There is more to a hash structure than just calculating the hashCode, you have to handle confilct/collision, resize the map and perhaps change the mapping function as the hash set fills.

The beauty of a red/black tree is that its always balanced, always O (ln N) and fixups are really fast, localized and easy. Its really a very clever structure.

I started out with a bias similar to Tim's, but as I learn more about Red/Black trees, the more impressed I am.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15962
    
  19

Pat Farrell wrote:
Tim Holloway wrote:Still, it's usually going to be less overhead to calculate a hash than to balance a tree (YMMV).


There is more to a hash structure than just calculating the hashCode, you have to handle confilct/collision, resize the map and perhaps change the mapping function as the hash set fills.

The beauty of a red/black tree is that its always balanced, always O (ln N) and fixups are really fast, localized and easy. Its really a very clever structure.

I started out with a bias similar to Tim's, but as I learn more about Red/Black trees, the more impressed I am.


The right tool for the job. In actuality, most Hashmap usage doesn't get that sophisticated. The hashtable size is generally fixed, as is the hash algorithm. When the hash value is cached with the object, the overhead becomes minimal, assuming the object gets referenced multiple times. Hash collisions are generally managed via singly-linked lists, which can be traversed very rapidly, since often there's no more than 3 or 4 elements in even the longest overflow list and don't require a whole lot of logic to maintain. For absolute maximum access, you can calculate a "perfect hash", which is simply the art of designing the hash to eliminate collisions. There are a number of good articles on how to do this.

This isn't to say that I always use hashes and never use balanced trees. Hashes are often a lot less memory-efficient, for one thing. I've seen a number of databases where the selection of hash versus tree was selectable on a per-index basis.

There's no "best" answer, so do your homework, measure performance, and above all, stay flexible.

And, if it's not a performance-critical part of the system, don't agonize too much over whether you made the "right" choice. Software offers enough agony without making up any more.
 
Consider Paul's rocket mass heater.
 
subject: Has anyone bothered to develop rules of thumb for HashSet vs TreeSet
 
Similar Threads
Sorting question
What is better performance HashMap or a TreeSet
how can be retrieve from any collection into our own order?
If set is interface then how it is instantiated:
rule of thumb for sequential list searching