[Tutor] Binary search question

Sat Apr 24 01:19:53 CEST 2010

On Sat, 24 Apr 2010 07:21:13 am Alan Gauld wrote:
> "Emile van Sebille" <emile at fenx.com> wrote
>
> > It's expensive enough that for a list this size I'd convert it to a
> > dict and use in on that.  eg,
> >
> > a = range(100000)
> > d = dict(zip(a,a))
>
> Surely that would depend on how often you do the search?
> If its a one off occurence I'd expect the overhead of zipping
> and converting to a dict would outweight the savings?

It absolutely has to, because zipping it has to iterate over the entire 
list, then calling dict has to iterate over the entire zipped version. 
That's iterating over the entire list *twice* before you even START 
doing a search!

In Python 3.x, zip is a lazy iterator so that will reduce the excess 
iterations from twice to once, but for a one-off test it entirely 
negates the point of converting.

> If the search was inside a loop however then I'd definitely
> agree. Although I'd opt for a set rather than a dict...

Yes, there's no point in making a dict {a:a} just for membership testing 
when you can just use a set.

> Another option would be to use the bisect module on a
> sorted version of the list.

But keep in mind that unless the list is HUGE, or your Python version 
includes the C version of bisect, a linear search using in may end up 
being faster than a slow pure-Python version of bisect.

Also, bisect on its own doesn't do membership testing:

>>> data = range(0,100,2)
>>> 7 in data
False
>>> bisect.bisect(data, 7)
4

So you have to weigh up the extra complication needed versus the 
optimization.

Another strategy is to move items closer to the front of the list each 
time they are accessed, so that commonly used items eventually bubble 
up to the front of the list and searches for them are fast.

And finally, if the list is very large, and your searches tend to be 
clustered, it becomes wasteful to do a fresh binary search each time 
you look something up. (E.g. consider looking up these three words in 
the dictionary, one after the other: "caravan", "cat", "cap".) In this 
case, a good strategy is sometimes called "hunt and search". The 
algorithm is something like this:

Each search takes a second argument, i, the place to start searching 
from. This will usually be the place the previous search ended at.

First you *hunt* for the item: try to bracket the item you want between 
i and i+1, then i and i+2, then i and i+4, i+8, and so forth, doubling 
the size of the bracket each time. (This assumes that the value at 
index i was too small. If it were too large, you hunt in the opposite 
direction, with i-1 to i, i-2 to i, etc.)

Once you have bracketed the item you are searching for, you *search* for 
it within those limits, using binary search or even a linear search. If 
your searches are clustered, most of the hunt phases will be short and 
even linear search will be fast, rather than doing a binary search over 
the full list each time.

-- 
Steven D'Aprano