I have a very big unordered collection (10,000-50,000) and I need a very fast search method using their int id. Is a hashtable (with Integer key) the best method to use? Or there is a better alternative?
Thanks, Mac [ November 06, 2008: Message edited by: Siamak Saarmann ]
PhD Candidate: Distributed and Parallel Systems, Simulation and Modeling
is the collection going to grow at some point? will it eventually become a million or a billion or more elements? How often are elements added or deleted?
Also, how often do you need to do this? if you are going to search for an element millions of times, you may do better to take the performance hit and sort it once (high initial cost) so that all the later searches are quick (low subsequent cost).
However, if you are only going to search for a dozen or so items, it's not worth sorting.
How is the collection built? is it rebuilt on every run? can you build it once, sort it, and save that for subsequent use?
How you answer each of these questions can make a difference in what you do. There is seldom a 'best' way to do something in software design. You have to weigh all the variables and decide what is most important... and then go from there.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Joined: Aug 21, 2004
Actually the collection is the list of agents in a simulation software. In each time step each agent will perform its actions, however it needs to look at several other agents (at least 5) to be able to decide. So I need 50,000 * 5 searches for each time step.
Considering that I need to simulate hundred thousands of time steps I will need 100,000 * 50,000 * 5 searches in each simulation.
@Dmitri : I already use colt for random numbers, thank you for mentioning the capability of Colt. I never looked at other features of it.
@campbell and arulk: thanks for suggestions. I guess experimenting will give me the best choice.
@fred : Agents come and go (added and removed) and the size of the collection is not fixed (a very dynamic system is being simulated).
Joined: Aug 21, 2004
Just wanted to give a feedback on this.
For the simplicity I used HashTable with String keys (integer agent id's are converted to string and used as key) and even though this is not a professional work (I should rather use Integer keys) I was able to get from 8 to 80 times (depending on the number of agents in my collection) speedup.
It means a simulation which takes 80 seconds with linear search will only need 10 seconds or less now.
I was thinking if I work on the performance (in every section of the 20,000-30,000 lines of code software) I can make huge difference.
I wrote this on HashMap behaviour a while back and should contain the data you want. It uses String keys rather than Integers, but given immutable keys with cached hash values, the insert, rehash and read times are pretty darn good.