[Python-Dev] Re: heaps

Tim Peters tim.one@comcast.net
Tue, 06 May 2003 22:42:29 -0400


[David Eppstein]
>>> For fairness, it might be interesting to try another run of your test
>>> in which the input sequence is sorted in increasing order rather
>>> than random.

[Tim]
>> Comparing the worst case of one against the best case of the
>> other isn't my idea of fairness <wink>, but sure.

[David]
> Well, it doesn't seem any fairer to use random data to compare an
> algorithm with an average time bound that depends on an assumption of
> randomness in the data...anyway, the point was more to understand the
> limiting cases.  If one algorithm is usually 3x faster than the other,
> and is never more than 10x slower, that's better than being usually 3x
> faster but sometimes 1000x slower, for instance.

Sure.  In practice you need to know time distributions when using an
algorithm -- best, expected, worse, and how likely each are under a variety
of expected conditions.

> My Java KBest code was written to make data subsets for a half-dozen web
> pages (same data selected according to different criteria).  Of these
> six instances, one is presented the data in roughly ascending order, one
> in descending order, and the other four are less clear but probably not
> random.
>
> Robustness in the face of this sort of variation is why I prefer any
> average-case assumptions in my code's performance to depend only on
> randomness from a random number generator, and not arbitrariness in the
> actual input.  But I'm not sure I'd usually be willing to pay a 3x
> penalty for that robustness.

Most people aren't, until they hit a bad case <0.5 wink>.  So "pure"
algorithms rarely survive in the face of a large variety of large problem
instances.  The monumental complications Python's list.sort() endures to
work well under many conditions (both friendly and hostile) is a good
example of that.  In industrial real life, I expect an all-purpose N-Best
queue would need to take a hybrid approach, monitoring its fast-path gimmick
in some cheap way in order to fall back to a more defensive algorithm when
the fast-path gimmick isn't paying.

>> Here's a surprise:  I coded a variant of the quicksort-like
>> partitioning method, at the bottom of this mail.  On the largest-1000
>> of a million random-float case, times were remarkably steady across
>> trials (i.e., using a different set of a million random floats each
>> time):
>>
>> heapq                    0.96 seconds
>> sort (micro-optimized)   3.4  seconds
>> KBest (below)            2.6  seconds

> Huh.  You're almost convincing me that asymptotic analysis works even in
> the presence of Python's compiled-vs-interpreted anomalies.

Indeed, you can't fight the math!  It often takes a large problem for better
O() behavior to overcome a smaller constant in a worse O() approach, and
especially in Python.  For example, I once wrote and tuned and timed an O(N)
worst-case rank algorithm in Python ("find the k'th smallest item in a
sequence"), using the median-of-medians-of-5 business.  I didn't have enough
RAM at the time to create a list big enough for it to beat "seq.sort();
return seq[k]".  By playing lots of tricks, and boosting it to
median-of-medians-of-11, IIRC I eventually got it to run faster than sorting
on lists with "just" a few hundred thousand elements.

But in *this* case I'm not sure that the only thing we're really measuring
isn't:

1. Whether an algorithm has an early-out gimmick.
2. How effective that early-out gimmick is.
and
3. How expensive it is to *try* the early-out gimmick.

The heapq method Rulz on random data because its answers then are "yes,
very, dirt cheap".  I wrote the KBest test like so:

def three(seq, N):
    NBest = KBest(N, -1e200)
    for x in seq:
        NBest.put(x)
    L = NBest.get()
    L.sort()
    return L

(the sort at the end is just so the results can be compared against the
other methods, to ensure they all get the same answer).  If I break into the
abstraction and change the test like so:

def three(seq, N):
    NBest = KBest(N, -1e200)
    cutoff = -1e200
    for x in seq:
        if x > cutoff:
            NBest.put(x)
            cutoff = NBest.cutoff
    L = NBest.get()
    L.sort()
    return L

then KBest is about 20% *faster* than heapq on random data.  Doing the
comparison inline avoids a method call when early-out pays, early-out pays
more and more as the sequence nears its end, and simply avoiding the method
call then makes the overall algorithm 3X faster.  So O() analysis may
triumph when equivalent low-level speed tricks are played (the heapq method
did its early-out test inline too), but get swamped before doing so.

> The other surprise is that (unlike, say, the sort or heapq versions)
> your KBest doesn't look significantly more concise than my earlier Java
> implementation.

The only thing I was trying to minimize was my time in whipping up something
correct to measure.  Still, I count 107 non-blank, non-comment lines of
Java, and 59 of Python.  Java gets unduly penalized for curly braces, Python
for tedious tricks like

        buf = self.buf
        k = self.k

to make locals for speed, and that I never put dependent code on the same
line as an "if" or "while" test (while you do).  Note that it's not quite
the same algorithm:  the Python version isn't restricted to ints, and in
particular doesn't assume it can do arithmetic on a key to get "the next
larger" key.  Instead it does 3-way partitioning to find the items equal to
the pivot.  The greater generality may make the Python a little windier.

BTW, the heapq code isn't really more compact than C, if you count the
implementation code in heapq.py too:  it's all low-level small-int
arithmetic and array indexing.  The only real advantage Python has over
assembler for code like that is that we can grow the list/heap dynamically
without any coding effort.