This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
I've got a small list of keywords that I need to seach through to see if the word I've just parsed is a keyword. Something like:
Clearly, is the number of entries (N) is small, and if the number of words to check (T) is small, then its simplier and faster to just do a linear search through the list.
Seems to me (without doing any testing) that even for fairly large values of T, as long as N is very small (say 6) then there is no point in using something fancier such as a HashSet. In Big-OH terms, while a linear search is O(N) each, and the hash lookup is O(1) there are lots of constants to look at, and for the HashSet, the constants are fairly large compared to nearly nothing in the boriing linear search.
Does anyone have any rules of thumb for when its worth thinking about fancier approaches?
Or is this yet another time when you should ignore performance until its a problem?
Well, the absolute worst-case for the HashSet is if every one of your constants hashed to the same bucket; then to check for membership the HashSet would get the target's hashCode(), do a mod operation to choose the bucket, then linear search through the bucket. So the worst case is definitely worse than the naive linear search.
But the chance of that worst case happening is vanishingly small. In a very common case, you'd probably have one bucket with two entries, and the other ones just singletons. Then to find the target, the HashSet finds the hashCode() of the target (Strings cache computed hashCode values, by the way) does the mod operation to find the bucket, and does one equals() to verify membership. Intuitively, that's three method calls, so it's roughly comparable cost-wise to a linear search of a three-element array. That's ignoring some subtleties, like the fact that mod is expensive, but also ignoring the loop overhead of the naive linear search. So still, I'd call it a wash.
It seems to be that as long as the number of targets is large enough to justify the expense of initializing the HashSet, then if there are more than three keywords, the HashSet is worth it. How many target is that? Well, you have to allocate about one bucket for each key, and compute the hash codes of the keys... so maybe the number of targets has to be three times the size of the keyword array to justify the expense of setting up the HashSet. After that, it's all gravy.
With a small set of words, you will almost certainly not see a noticeable difference in speed, but why are you trying to avoid using the "fancier" Set, exactly?
It doesn't cost more from a code perspective: one extra line to create the set, but then your checks are now a single method call, so you save a loop on each check.
The code is more readable, and more scalable in case your word list grows. It will use a few more bytes of memory, but again, with the numbers we are talking about, the difference in both speed and size is infinitesimal.
A library of sorts and searches released long ago by PR1ME Computer (they made minicomputers used by NASA and Ford's design team) had a cut-off point. If the list was 10 items or less, they reverted to a linear scan.
Which brings up a good point. If the vendor supplies a search function, use it. It's possible that it was optimized for the target hardware, so you can have portable and optimal at the same time.
An IDE is no substitute for an Intelligent Developer.