[Python-Dev] Re: heaps

Sun, 04 May 2003 22:20:09 -0400

This is a multi-part message in MIME format.

--Boundary_(ID_19ec9GZ0WH09Yh6NFIvyew)
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 7BIT

[David Eppstein]
> Ok, I think you're right, for this sort of thing heapq is better.
> One could extend my priorityDictionary code to limit memory like
> this but it would be unnecessary work when the extra features it
> has over heapq are not used for this sort of algorithm.

I don't believe memory usage was an issue here.  Take a look at the code
again (comments removed):

"""
q = priorityDictionary()

for dummy in xrange(N):
    q[Person(-1)] = -1

for person in people:
    if person.income > q.smallest().income:
        del q[q.smallest()]
        q[person] = person.income
"""

q starts with N entries.  Each trip around the loop either leaves the q
contents alone, or both removes and adds an entry.  So the size of the dict
is a loop invariant, len(q) == N.  In the cases where it does remove an
entry, it always removes the smallest entry, and the entry being added is
strictly larger than that, so calling q.smallest() at the start of the next
loop trip finds the just-deleted smallest entry still in self.__heap[0], and
removes it.  So the internal list may grow to N+1 entries immediately
following

        del q[q.smallest()]

but by the time we get to that line again it should be back to N entries
again.

The reasons I found heapq easier to live with in this specific app had more
to do with the subtleties involved in sidestepping potential problems with
__hash__, __cmp__, and the speed of tuple comparison when the first tuple
elements tie.  heapq also supplies a "remove current smallest and replace
with a new value" primitive, which happens to be just right for this app
(that isn't an accident <wink>):

"""
dummy = Person(-1)
q = [dummy] * N

for person in people:
    if person > q[0]:
        heapq.heapreplace(q, person)
"""

> On the other hand, if you really want to find the n best items in a data
> stream large enough that you care about using only space O(n), it might
> also be preferable to take constant amortized time per item rather than
> the O(log n) that heapq would use,

In practice, it's usually much faster than that.  Over time, it gets rarer
and rarer for

    person > q[0]

to be true (the new person has to be larger than the N-th largest seen so
far, and that bar gets raised whenever a new person manages to hurdle it),
and the vast majority of sequence elements are disposed with via that single
Python statement (the ">" test fails, and we move on to the next element
with no heap operations).  In the simplest case, if N==1, the incoming data
is randomly ordered, and the incoming sequence has M elements, the if-test
is true (on average) only ln(M) times (the expected number of left-to-right
maxima).  The order statistics get more complicated as N increases, of
course, but in practice it remains very fast, and doing a heapreplace() on
every incoming item is the worst case (achieved if the items come in sorted
order; the best case is when they come in reverse-sorted order, in which
case min(M, N) heapreplace() operations are done).

> and it's not very difficult nor does it require any fancy data
> structures.  Some time back I needed some Java code for this,
> haven't had an excuse to port it to Python.  In case anyone's
> interested, it's online at
> <http://www.ics.uci.edu/~eppstein/161/KBest.java>.
> Looking at it now, it seems more complicated than it needs to be, but
> maybe that's just the effect of writing in Java instead of Python
> (I've seen an example of a  three-page Java implementation of an
> algorithm in a textbook that could  easily be done in a dozen Python
> lines).

Cool!  I understood the thrust but not the details -- and I agree Java must
be making it harder than it should be <wink>.

> In python, this could be done without randomization as simply as
>
> def addToNBest(L,x,N):
>    L.append(x)
>    if len(L) > 2*N:
>        L.sort()
>        del L[N:]
>
> It's not constant amortized time due to the sort, but that's probably
> more than made up for due to the speed of compiled sort versus
> interpreted randomized pivot.

I'll attach a little timing script.  addToNBest is done inline there, some
low-level tricks were played to speed it, and it was changed to be a max
N-best instead of a min N-best.  Note that the list sort in 2.3 has a real
advantage over Pythons before 2.3 here, because it recognizes (in linear
time) that the first half of the list is already in sorted order (on the
second & subsequent sorts), and leaves it alone until a final merge step
with the other half of the array.

The relative speed (compared to the heapq code) varies under 2.3, seeming to
depend mostly on M/N.  The test case is set up to find the 1000 largest of a
million random floats.  In that case the sorting method takes about 3.4x
longer than the heapq approach.  As N gets closer to M, the sorting method
eventually wins; when M and N are both a million, the sorting method is 10x
faster.  For most N-best apps, M is much smaller than N, and the heapq code
should be quicker unless the data is already in order.

--Boundary_(ID_19ec9GZ0WH09Yh6NFIvyew)
Content-type: text/plain; name=timeq.py
Content-transfer-encoding: 7BIT
Content-disposition: attachment; filename=timeq.py

def one(seq, N):
    from heapq import heapreplace

    L = [-1] * N
    for x in seq:
        if x > L[0]:
            heapreplace(L, x)

    L.sort()
    return L

def two(seq, N):
    L = []
    push = L.append
    twoN = 2*N
    for x in seq:
        push(x)
        if len(L) > twoN:
            L.sort()
            del L[:-N]

    L.sort()
    del L[:-N]
    return L

def timeit(seq, N):
    from time import clock as now

    s = now()
    r1 = one(seq, N)
    t = now()
    e1 = t - s

    s = now()
    r2 = two(seq, N)
    t = now()
    e2 = t - s

    print len(seq), N, e1, e2
    assert r1 == r2

def tryone(M, N):
    from random import random
    seq = [random() for dummy in xrange(M)]
    timeit(seq, N)

for i in range(10):
    tryone(1000000, 1000)

--Boundary_(ID_19ec9GZ0WH09Yh6NFIvyew)--