[Tutor] Binary search question
Steven D'Aprano
steve at pearwood.info
Sat Apr 24 02:37:15 CEST 2010
On Sat, 24 Apr 2010 09:11:21 am Robert Berman wrote:
> Wow. I feel I have just been b…h slapped across the face.
There's no need to be so sensitive.
> I think
> Hugo’s test results pretty much confirmed ‘in’ is not the way to go
If that is the *only* thing you have learned from this email thread,
then you have failed to understand what folks are trying to say.
Whether that is your failure or our failure is another question :)
The most important thing is that the right solution depends on what your
data is and what you intend doing with it.
If your data is unordered, and all you do is membership testing, then
using the `in` operator with a set is the right solution. Membership
testing of sets is (approximately) O(1), which is much better than
O(log N) for binary search and MUCH better than O(N) for linear search.
If you need a list, and you have the opportunity to control the format
of your data, then you can keep the data in a list *and* a set,
carefully keeping the two in sync. A class is probably good for that.
Use the list for list-like things, and the set for membership testing,
and that gives you the best of both worlds.
If your data is sorted, then binary search may be the right approach.
But if your list is small enough, then the extra overhead of binary
search will actually make it *slower* than a linear search.
If your data needs to be unsorted (say, you have to store it in the
order it is delivered to you) then binary search is *not* suitable. In
the worst case, you may have no choice but a linear search: copying the
list into a fresh set each time is wasteful, or the list might include
objects which are mutable and can't go into sets or dicts. Fortunately,
Python lists are written in C and so even a linear search using `in` is
pretty fast.
But probably the best advice given is, don't worry about optimizing your
code until you *know* it needs optimizing. Good advice for searching a
list with ten million items is *at best* pointless for searching a ten
item list, and at worst actually harmful. It's hard to predict where
bottlenecks are at the best of times, and particularly hard for people
used to languages like C or Java. Text books that give optimization
advice are often counter-productive when it comes to Python code. For
instance, they will often assume that comparisons are cheap and moving
data is relatively expensive, which is true for low-level languages
like C. But for Python it's the other way around, comparisons are
relatively expensive and moving data relatively cheap.
I've actually seen somebody do something like this in the name
of "efficiency":
def issorted(alist):
"""Helper function returning true if alist is already sorted"""
if len(alist) == 0:
return True
x = alist[0]
flag = True
for item in alist:
if item < x:
flag = False
break
x = item
return flag
def bin_search(alist, what):
if not issorted(alist):
# make a copy of the list and work with that
alist = alist[:] # Careful! Don't change the original!
alist.sort()
i = bisect.bisect(alist, what)
if i == 0:
return False
else:
return alist[i-1] == what
If you think carefully about what this is doing, it is hilarious! For
beginners who don't quite see, consider what happens if you give it a
sorted list. First it walks the entire list, to determine whether it is
sorted, which takes N steps. Then it does a binary search, which is
O(log N). Since the time is dominated by the O(N) test for sortedness,
it would be quicker to just do a linear search.
If the list is unsorted, first it walks some fraction of the list to
discover the fact, then it does a O(N) copy, plus an O(N*log N) sort,
before the binary search. And it does this *every single time*. Total,
utter fail.
True confession time: the person who wrote this ridiculous piece of code
was... me. My reasoning was that binary search was much faster than
linear search, so it was worth sorting the list to get that advantage.
Then I realised that I couldn't just sort the list, because other parts
of the code might rely on it being unsorted, hence I had to make a copy
first. Then I reasoned that I didn't need to copy it if it were already
sorted.
For the record, here's some timing benchmarks to see how badly it turned
out:
>>> from random import shuffle
>>> data1 = range(10000)
>>> data2 = range(10000)
>>> shuffle(data2)
>>>
>>> setup = """from __main__ import data1, data2, bin_search"""
>>> from timeit import Timer
>>> t1 = Timer("457 in data1", setup)
>>> t2 = Timer("457 in data2", setup)
>>> t3 = Timer("bin_search(data1, 457)", setup)
>>> t4 = Timer("bin_search(data2, 457)", setup)
>>>
>>> for t in (t1, t2, t3, t4):
... print min(t.repeat(number=1000))
...
0.0346097946167
0.461233854294
3.04955101013
5.70547604561
For comparison, here's the timing on a plain binary search:
There, I have now exposed my secret shame to the whole world. Please
don't hate me.
*wink*
--
Steven D'Aprano
More information about the Tutor
mailing list