space-efficient top-N algorithm

Christos TZOTZIOY Georgiou DLNXPEGFQVEB at spammotel.com
Sun Feb 9 08:43:28 EST 2003


On Sun, 09 Feb 2003 18:56:06, rumours say that David Garamond
<lists at zara.6.isreserved.com> might have written:

>Is there a more space-efficient algorithm to produce a list of top N 
>recurring items from a large collection? I've thought about sorting the 
>large log files externally, but reluctant to do this because of the 
>large size of the files and the need to frequently produce reports while 
>new logs arrive.

A solution using heapq (for ordering while reading) + sets (for
uniqueness) might be appropriate.  You just have to del heap[N:] once in
a while to avoid excess memory consumption.
-- 
TZOTZIOY, I speak England very best,
Real email address: 'dHpvdEBzaWwtdGVjLmdy\n'.decode('base64')




More information about the Python-list mailing list