File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes rule of thumb for sequential list searching Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "rule of thumb for sequential list searching" Watch "rule of thumb for sequential list searching" New topic
Author

rule of thumb for sequential list searching

Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4636
    
    5

I've got a small list of keywords that I need to seach through to see if the word I've just parsed is a keyword. Something like:


Clearly, is the number of entries (N) is small, and if the number of words to check (T) is small, then its simplier and faster to just do a linear search through the list.

Seems to me (without doing any testing) that even for fairly large values of T, as long as N is very small (say 6) then there is no point in using something fancier such as a HashSet. In Big-OH terms, while a linear search is O(N) each, and the hash lookup is O(1) there are lots of constants to look at, and for the HashSet, the constants are fairly large compared to nearly nothing in the boriing linear search.

Does anyone have any rules of thumb for when its worth thinking about fancier approaches?

Or is this yet another time when you should ignore performance until its a problem?
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24166
    
  30

Well, the absolute worst-case for the HashSet is if every one of your constants hashed to the same bucket; then to check for membership the HashSet would get the target's hashCode(), do a mod operation to choose the bucket, then linear search through the bucket. So the worst case is definitely worse than the naive linear search.

But the chance of that worst case happening is vanishingly small. In a very common case, you'd probably have one bucket with two entries, and the other ones just singletons. Then to find the target, the HashSet finds the hashCode() of the target (Strings cache computed hashCode values, by the way) does the mod operation to find the bucket, and does one equals() to verify membership. Intuitively, that's three method calls, so it's roughly comparable cost-wise to a linear search of a three-element array. That's ignoring some subtleties, like the fact that mod is expensive, but also ignoring the loop overhead of the naive linear search. So still, I'd call it a wash.

It seems to be that as long as the number of targets is large enough to justify the expense of initializing the HashSet, then if there are more than three keywords, the HashSet is worth it. How many target is that? Well, you have to allocate about one bucket for each key, and compute the hash codes of the keys... so maybe the number of targets has to be three times the size of the keyword array to justify the expense of setting up the HashSet. After that, it's all gravy.


[Jess in Action][AskingGoodQuestions]
Dmitri Bichko
Greenhorn

Joined: Jun 16, 2007
Posts: 15
With a small set of words, you will almost certainly not see a noticeable difference in speed, but why are you trying to avoid using the "fancier" Set, exactly?

It doesn't cost more from a code perspective: one extra line to create the set, but then your checks are now a single method call, so you save a loop on each check.

The code is more readable, and more scalable in case your word list grows. It will use a few more bytes of memory, but again, with the numbers we are talking about, the difference in both speed and size is infinitesimal.

This just doesn't seem that onerous to me:
Anirvan Majumdar
Ranch Hand

Joined: Feb 22, 2005
Posts: 261
Dmitri Bichko wrote:

Just a word of caution about that usage - Arrays.asList will return a fixed-sized list, so you won't be able to add, remove or clear that list's value[s].
Dmitri Bichko
Greenhorn

Joined: Jun 16, 2007
Posts: 15
Anirvan Majumdar wrote:
Dmitri Bichko wrote:

Just a word of caution about that usage - Arrays.asList will return a fixed-sized list, so you won't be able to add, remove or clear that list's value[s].


Well sure, but it's just used for initialization here, you obviously can't manipulate the list.

Actually, I didn't realize that asList() takes a dynamic arg list, so the array is unnecessary:
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Dmitri wrote:Actually, I didn't realize that asList() takes a dynamic arg list, so the array is unnecessary:

Yup. You can streamline this even more using Google Collections:

They have many factory methods to eliminate redundant generic params and unnecessary Arrays.asList() calls.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4636
    
    5

Thanks for the pointer to Google's Java Collections. I'm using it in the code I'm writing, because you pointed it out to me.

I'm not that big a fan of using one line instead of three or four, but I really like their ImmutableFoo collections, for Foo == Set/Map/List

And they have some nice idioms for initializing Map structures that are well suited for exactly my use case here. i.e.

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Another convert, yay!

I also just came across the final list of Project Coin changes accepted for JDK 7. (Assuming no implementation problems arise.) From this, in the future we should be able to do something like this:

The resulting map will be immutable; whether it's typed as an ImmutableMap will depend on how much Google Collections gets integrated into the JDK.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15632
    
  15

A library of sorts and searches released long ago by PR1ME Computer (they made minicomputers used by NASA and Ford's design team) had a cut-off point. If the list was 10 items or less, they reverted to a linear scan.

Which brings up a good point. If the vendor supplies a search function, use it. It's possible that it was optimized for the target hardware, so you can have portable and optimal at the same time.


Customer surveys are for companies who didn't pay proper attention to begin with.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: rule of thumb for sequential list searching
 
Similar Threads
New sort algorthim
Array problem
big O notation
ArrayList vs LinkedList?
Vector operation implementation why JDK preferred arrays over linkedlist