[Tutor] Binary search question

Sat Apr 24 02:37:15 CEST 2010

On Sat, 24 Apr 2010 09:11:21 am Robert Berman wrote:
> Wow. I feel I have just been b…h slapped across the face.

There's no need to be so sensitive.

> I think 
> Hugo’s test results pretty much confirmed ‘in’ is not the way to go

If that is the *only* thing you have learned from this email thread, 
then you have failed to understand what folks are trying to say. 
Whether that is your failure or our failure is another question :)

The most important thing is that the right solution depends on what your 
data is and what you intend doing with it.

If your data is unordered, and all you do is membership testing, then 
using the `in` operator with a set is the right solution. Membership 
testing of sets is (approximately) O(1), which is much better than 
O(log N) for binary search and MUCH better than O(N) for linear search.

If you need a list, and you have the opportunity to control the format 
of your data, then you can keep the data in a list *and* a set, 
carefully keeping the two in sync. A class is probably good for that. 
Use the list for list-like things, and the set for membership testing, 
and that gives you the best of both worlds.

If your data is sorted, then binary search may be the right approach. 
But if your list is small enough, then the extra overhead of binary 
search will actually make it *slower* than a linear search.

If your data needs to be unsorted (say, you have to store it in the 
order it is delivered to you) then binary search is *not* suitable. In 
the worst case, you may have no choice but a linear search: copying the 
list into a fresh set each time is wasteful, or the list might include 
objects which are mutable and can't go into sets or dicts. Fortunately, 
Python lists are written in C and so even a linear search using `in` is 
pretty fast.

But probably the best advice given is, don't worry about optimizing your 
code until you *know* it needs optimizing. Good advice for searching a 
list with ten million items is *at best* pointless for searching a ten 
item list, and at worst actually harmful. It's hard to predict where 
bottlenecks are at the best of times, and particularly hard for people 
used to languages like C or Java. Text books that give optimization 
advice are often counter-productive when it comes to Python code. For 
instance, they will often assume that comparisons are cheap and moving 
data is relatively expensive, which is true for low-level languages 
like C. But for Python it's the other way around, comparisons are 
relatively expensive and moving data relatively cheap.

I've actually seen somebody do something like this in the name 
of "efficiency":

def issorted(alist):
    """Helper function returning true if alist is already sorted"""
    if len(alist) == 0:
        return True
    x = alist[0]
    flag = True
    for item in alist:
        if item < x:
            flag = False
            break
        x = item
    return flag

def bin_search(alist, what):
    if not issorted(alist):
        # make a copy of the list and work with that
        alist = alist[:]  # Careful! Don't change the original!
        alist.sort()
    i = bisect.bisect(alist, what)
    if i == 0:
        return False
    else:
        return alist[i-1] == what

If you think carefully about what this is doing, it is hilarious! For 
beginners who don't quite see, consider what happens if you give it a 
sorted list. First it walks the entire list, to determine whether it is 
sorted, which takes N steps. Then it does a binary search, which is 
O(log N). Since the time is dominated by the O(N) test for sortedness, 
it would be quicker to just do a linear search.

If the list is unsorted, first it walks some fraction of the list to 
discover the fact, then it does a O(N) copy, plus an O(N*log N) sort, 
before the binary search. And it does this *every single time*. Total, 
utter fail.

True confession time: the person who wrote this ridiculous piece of code 
was... me. My reasoning was that binary search was much faster than 
linear search, so it was worth sorting the list to get that advantage. 
Then I realised that I couldn't just sort the list, because other parts 
of the code might rely on it being unsorted, hence I had to make a copy 
first. Then I reasoned that I didn't need to copy it if it were already 
sorted.

For the record, here's some timing benchmarks to see how badly it turned 
out:

>>> from random import shuffle
>>> data1 = range(10000)
>>> data2 = range(10000)
>>> shuffle(data2)
>>>
>>> setup = """from __main__ import data1, data2, bin_search"""
>>> from timeit import Timer
>>> t1 = Timer("457 in data1", setup)
>>> t2 = Timer("457 in data2", setup)
>>> t3 = Timer("bin_search(data1, 457)", setup)
>>> t4 = Timer("bin_search(data2, 457)", setup)
>>>
>>> for t in (t1, t2, t3, t4):
...     print min(t.repeat(number=1000))
...
0.0346097946167
0.461233854294
3.04955101013
5.70547604561

For comparison, here's the timing on a plain binary search:

There, I have now exposed my secret shame to the whole world. Please 
don't hate me.

*wink*

-- 
Steven D'Aprano