Keeping track of the N largest values
Peter Otten
__peter__ at web.de
Sun Dec 26 05:36:40 EST 2010
Roy Smith wrote:
> In article <Xns9E59A44D7CC49duncanbooth at 127.0.0.1>,
> Duncan Booth <duncan.booth at invalid.invalid> wrote:
>
>> Roy Smith <roy at panix.com> wrote:
>>
>> >
>> > I'm processing a stream of N numbers and want to keep track of the K
>> > largest. There's too many numbers in the stream (i.e. N is too large)
>> > to keep in memory at once. K is small (100 would be typical).
>> > ...
>> > Is there a better way to do this, either from a theoretical running
>> > time point of view, or just a nicer way to code this in Python?
>>
>> How about:
>>
>> from heapq import nlargest
>> top = nlargest(K, input())
>>
>> It uses a heap so avoids completely resorting each time.
>
> Hmmm, that looks like it would solve the problem as stated, but there's
> another twist. In addition to finding the K largest values, I *also*
> need to do some other processing on all the values (which I didn't show
> in the original example, incorrectly thinking it not germane to my
> question). The problem with nlargest() is that it doesn't give me a
> hook to do that.
If Paul's solution doesn't suffice -- the heapq module has the building
blocks for a custom solution, e. g.:
import heapq
from functools import partial
class NLargest(object):
def __init__(self, n):
self.n = n
self.heap = []
def tally(self, item):
heap = self.heap
if len(heap) >= self.n:
heapq.heappushpop(heap, item)
self.tally = partial(heapq.heappushpop, heap)
else:
heapq.heappush(heap, item)
def __str__(self):
return str(sorted(self.heap, reverse=True))
if __name__ == "__main__":
import random
items = range(100)
random.shuffle(items)
accu = NLargest(10)
for item in items:
accu.tally(item)
print item, accu
> PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
> proportional to n, and not to the iterator length. That's the only
> reasonable conclusion, but the docs don't actually come out and say it.
I would hope so.
More information about the Python-list
mailing list