Keeping track of the N largest values

Sat Dec 25 14:16:27 EST 2010

In article <Xns9E59A44D7CC49duncanbooth at 127.0.0.1>,
 Duncan Booth <duncan.booth at invalid.invalid> wrote:

> Roy Smith <roy at panix.com> wrote:
> 
> > 
> > I'm processing a stream of N numbers and want to keep track of the K 
> > largest.  There's too many numbers in the stream (i.e. N is too large)
> > to keep in memory at once.  K is small (100 would be typical).
> > ... 
> > Is there a better way to do this, either from a theoretical running
> > time point of view, or just a nicer way to code this in Python?
> 
> How about:
> 
>   from heapq import nlargest
>   top = nlargest(K, input())
> 
> It uses a heap so avoids completely resorting each time.

Hmmm, that looks like it would solve the problem as stated, but there's 
another twist.  In addition to finding the K largest values, I *also* 
need to do some other processing on all the values (which I didn't show 
in the original example, incorrectly thinking it not germane to my 
question).  The problem with nlargest() is that it doesn't give me a 
hook to do that.

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory 
proportional to n, and not to the iterator length.  That's the only 
reasonable conclusion, but the docs don't actually come out and say it.