[David Eppstein, on the bar-raising behavior of
person > q ]
Good point. If any permutation of the input sequence is equally likely, and you're selecting the best k out of n items, the expected number of times you have to hit the data structure in your heapq solution is roughly k ln n, so the total expected time is O(n + k log k log n), with a really small constant factor on the O(n) term. The sorting solution I suggested has total time O(n log k), and even though sorting is built-in and fast it can't compete when k is small. Random pivoting is O(n + k), but with a larger constant factor, so your heapq solution looks like a winner.
In real Python Life, it's the fastest way I know (depending ...).
For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random.
Comparing the worst case of one against the best case of the other isn't my idea of fairness <wink>, but sure. On the best-1000 of a million floats test, and sorting the floats first, the heap method ran about 30x slower than on random data, and the sort method ran significantly faster than on random data (a factor of 1.3x faster). OTOH, if I undo my speed tricks and call a function in the sort method (instead of doing it all micro-optimized inline), that slows the sort method by a bit over a factor of 2.
I.e., replace the random generation of seq by seq = range(M) I'd try it myself, but I'm still running python 2.2 and haven't installed heapq. I'd have to know more about your application to have an idea whether the sorted or randomly-permuted case is more representative.
Of course -- so would I <wink>.
Here's a surprise: I coded a variant of the quicksort-like partitioning method, at the bottom of this mail. On the largest-1000 of a million random-float case, times were remarkably steady across trials (i.e., using a different set of a million random floats each time):
heapq 0.96 seconds sort (micro-optimized) 3.4 seconds KBest (below) 2.6 seconds
The KBest code creates new lists with wild abandon. I expect it does better than the sort method anyway because it gets to exploit its own form of "raise the bar" behavior as more elements come in. For example, on the first run, len(buf) exceeded 3000 only 14 times, and the final pivot value each time is used by put() as an "ignore the input unless it's bigger than that" cutoff:
pivoted w/ 0.247497558554 pivoted w/ 0.611006884768 pivoted w/ 0.633565558936 pivoted w/ 0.80516673256 pivoted w/ 0.814304890889 pivoted w/ 0.884660572175 pivoted w/ 0.89986744075 pivoted w/ 0.946575251872 pivoted w/ 0.980386533221 pivoted w/ 0.983743795382 pivoted w/ 0.992381911217 pivoted w/ 0.994243625292 pivoted w/ 0.99481443021 pivoted w/ 0.997044443344
The already-sorted case is also a bad case for this method, because then the pivot is never big enough to trigger the early exit in put().
def split(seq, pivot): lt, eq, gt = , ,  lta, eqa, gta = lt.append, eq.append, gt.append for x in seq: c = cmp(x, pivot) if c < 0: lta(x) elif c: gta(x) else: eqa(x) return lt, eq, gt
# KBest(k, minusinf) remembers the largest k objects # from a sequence of objects passed one at a time to # put(). minusinf must be smaller than any object # passed to put(). After feeding in all the objects, # call get() to retrieve a list of the k largest (or # as many as were passed to put(), if put() was called # fewer than k times).
class KBest(object): __slots__ = 'k', 'buflim', 'buf', 'cutoff'
def __init__(self, k, minusinf): self.k = k self.buflim = 3*k self.buf =  self.cutoff = minusinf
def put(self, obj): if obj <= self.cutoff: return
buf = self.buf buf.append(obj) if len(buf) <= self.buflim: return
# Reduce len(buf) by at least one, by retaining # at least k, and at most len(buf)-1, of the # largest objects in buf. from random import choice sofar =  k = self.k while len(sofar) < k: pivot = choice(buf) buf, eq, gt = split(buf, pivot) sofar.extend(gt) if len(sofar) < k: sofar.extend(eq[:k - len(sofar)])
self.buf = sofar self.cutoff = pivot
def get(self): from random import choice buf = self.buf k = self.k if len(buf) <= k: return buf
# Retain only the k largest. sofar =  needed = k while needed: pivot = choice(buf) lt, eq, gt = split(buf, pivot) if len(gt) <= needed: sofar.extend(gt) needed -= len(gt) if needed: takefromeq = min(len(eq), needed) sofar.extend(eq[:takefromeq]) needed -= takefromeq # If we still need more, they have to # come out of things < pivot. buf = lt else: # gt alone is too large. buf = gt
assert len(sofar) == k self.buf = sofar return sofar