[David Eppstein]
Ok, I think you're right, for this sort of thing heapq is better. One could extend my priorityDictionary code to limit memory like this but it would be unnecessary work when the extra features it has over heapq are not used for this sort of algorithm.
I don't believe memory usage was an issue here. Take a look at the code again (comments removed): """ q = priorityDictionary() for dummy in xrange(N): q[Person(-1)] = -1 for person in people: if person.income > q.smallest().income: del q[q.smallest()] q[person] = person.income """ q starts with N entries. Each trip around the loop either leaves the q contents alone, or both removes and adds an entry. So the size of the dict is a loop invariant, len(q) == N. In the cases where it does remove an entry, it always removes the smallest entry, and the entry being added is strictly larger than that, so calling q.smallest() at the start of the next loop trip finds the just-deleted smallest entry still in self.__heap[0], and removes it. So the internal list may grow to N+1 entries immediately following del q[q.smallest()] but by the time we get to that line again it should be back to N entries again. The reasons I found heapq easier to live with in this specific app had more to do with the subtleties involved in sidestepping potential problems with __hash__, __cmp__, and the speed of tuple comparison when the first tuple elements tie. heapq also supplies a "remove current smallest and replace with a new value" primitive, which happens to be just right for this app (that isn't an accident <wink>): """ dummy = Person(-1) q = [dummy] * N for person in people: if person > q[0]: heapq.heapreplace(q, person) """
On the other hand, if you really want to find the n best items in a data stream large enough that you care about using only space O(n), it might also be preferable to take constant amortized time per item rather than the O(log n) that heapq would use,
In practice, it's usually much faster than that. Over time, it gets rarer and rarer for person > q[0] to be true (the new person has to be larger than the N-th largest seen so far, and that bar gets raised whenever a new person manages to hurdle it), and the vast majority of sequence elements are disposed with via that single Python statement (the ">" test fails, and we move on to the next element with no heap operations). In the simplest case, if N==1, the incoming data is randomly ordered, and the incoming sequence has M elements, the if-test is true (on average) only ln(M) times (the expected number of left-to-right maxima). The order statistics get more complicated as N increases, of course, but in practice it remains very fast, and doing a heapreplace() on every incoming item is the worst case (achieved if the items come in sorted order; the best case is when they come in reverse-sorted order, in which case min(M, N) heapreplace() operations are done).
and it's not very difficult nor does it require any fancy data structures. Some time back I needed some Java code for this, haven't had an excuse to port it to Python. In case anyone's interested, it's online at http://www.ics.uci.edu/~eppstein/161/KBest.java. Looking at it now, it seems more complicated than it needs to be, but maybe that's just the effect of writing in Java instead of Python (I've seen an example of a three-page Java implementation of an algorithm in a textbook that could easily be done in a dozen Python lines).
Cool! I understood the thrust but not the details -- and I agree Java must be making it harder than it should be <wink>.
In python, this could be done without randomization as simply as
def addToNBest(L,x,N): L.append(x) if len(L) > 2*N: L.sort() del L[N:]
It's not constant amortized time due to the sort, but that's probably more than made up for due to the speed of compiled sort versus interpreted randomized pivot.
I'll attach a little timing script. addToNBest is done inline there, some low-level tricks were played to speed it, and it was changed to be a max N-best instead of a min N-best. Note that the list sort in 2.3 has a real advantage over Pythons before 2.3 here, because it recognizes (in linear time) that the first half of the list is already in sorted order (on the second & subsequent sorts), and leaves it alone until a final merge step with the other half of the array. The relative speed (compared to the heapq code) varies under 2.3, seeming to depend mostly on M/N. The test case is set up to find the 1000 largest of a million random floats. In that case the sorting method takes about 3.4x longer than the heapq approach. As N gets closer to M, the sorting method eventually wins; when M and N are both a million, the sorting method is 10x faster. For most N-best apps, M is much smaller than N, and the heapq code should be quicker unless the data is already in order.