Keeping track of the N largest values

Sat Dec 25 11:04:07 EST 2010

Roy Smith wrote:

> I'm processing a stream of N numbers and want to keep track of the K
> largest.  There's too many numbers in the stream (i.e. N is too large)
> to keep in memory at once.  K is small (100 would be typical).
> 
> From a theoretical point of view, I should be able to do this in N log K
> time.  What I'm doing now is essentially:
> 
> top = [-1]    # Assume all x are >= 0
> for x in input():
>     if x <= top[0]:
>         continue
>     top.append(x)
>     if len(top) > K:
>         top.sort()
>         top.pop(0)
> 
> I can see pathological cases (say, all input values the same) where
> running time would be N K log K, but on average (N >> K and random
> distribution of values), this should be pretty close to N.
> 
> Is there a better way to do this, either from a theoretical running time
> point of view, or just a nicer way to code this in Python?

http://docs.python.org/library/heapq.html#heapq.nlargest