[Tutor] Binary search question

Sat Apr 24 01:11:21 CEST 2010

From: tutor-bounces+bermanrl=cfl.rr.com at python.org
[mailto:tutor-bounces+bermanrl=cfl.rr.com at python.org] On Behalf Of Ricardo
Aráoz
Sent: Friday, April 23, 2010 6:33 PM
To: Hugo Arts
Cc: tutor at python.org; Emile van Sebille
Subject: Re: [Tutor] Binary search question

Hugo Arts wrote: 
On Fri, Apr 23, 2010 at 11:33 PM, Emile van Sebille <emile at fenx.com> wrote:

On 4/23/2010 2:21 PM Alan Gauld said...

"Emile van Sebille" <emile at fenx.com> wrote

It's expensive enough that for a list this size I'd convert it to a
dict and use in on that. eg,

a = range(100000)
d = dict(zip(a,a))

Surely that would depend on how often you do the search? If it's a one
off occurrence I'd expect the overhead of zipping and converting to a
dict would outweigh the savings?

Oh sure, but in practical terms, if it's a one-off situation who cares which
you us?  For a one-off I'd just use in and not be concerned.

I guess Alan missed my second e-mail with the micro benchmarks, but
that nuances my first answer quite a bit, and makes all the points he
is making. That does not make him less correct, of course. set (I used
frozenset, but either way) is faster than dict, and in a one-off
you're best off using bisect.

For completeness sake, on a 10000 item list, using the in operator
takes *in the worst case* around 7 seconds. bisect, again in the worst
case, takes only around 0.01 seconds (that's on a Core 2 Duo T5300 @
1.7 GHZ, 2 GB RAM). That's quite a savings. those 7 seconds can easily
be 99% of the execution time of a typical script. So for sufficiently
large data set it can definitely pay off to be concerned, even with a
one-off

Of course, profiling will immediately catch that kind of performance
bottleneck. So even if you care about performance, you can start off
using 'in' and easily optimize later with bisect or a set, the whole
"don't do it yet" thing and all.

Hugo
_______________________________________________

In the same vein of completeness sake, and since this is the *tutor* list, we
should stress that the 'right' approach is to use 'in' whatever the case. And
only if the system is too slow and profiling shows that 'in' is the culprit
should we seek for alternatives.

Wow. I feel I have just been b
h slapped across the face. I think Hugo’s test
results pretty much confirmed ‘in’ is not the way to go although it is one of
the methods I am trying even though my tendency, since the number of elements
is always less than 1,000,000, is to use the dictionary approach. 
But, even though my years of experience using Python is less than 4 I would be
reluctant to use ‘in’ just based on what I have been reading from those who
took the time to answer my post. Just my $0.02 worth.

Robert Berman

What you don't see with your eyes, don't invent with your mouth.