space-efficient top-N algorithm

Mon Feb 10 02:00:58 EST 2003

David Garamond <lists at zara.6.isreserved.com> wrote in message news:<mailman.1044791875.2092.python-list at python.org>...
> I am processing referrer log files with Python. The log files are in 
> ASCII text and line-based, similar to a webserver log but produced by 
> our banner management (web bug, more precisely) software. Each line 
> contains, among other, a URL, a timestamp, and an IP address. One of the 
> desired report is the list of top 50 recurring URLs within a certain 
> period (e.g. monthly or weekly).
> 
> What I currently do is just read the log files line by line, parse it 
> into its elements, and store the URL string as the key of a dictionary 
> and the number of occurence as the value. In the end I would just get 
> back to the dictionary and print the keys in descending order of the values.
>
> However, the number of URLs are large and some of the URLs are long 
> (>100 characters). My process grows into more than 100MB in size. I 
> already cut the URLs to a max of 80 characters before entering them into 
>   the dictionary, but it doesn't help much.
> 
> Is there a more space-efficient algorithm to produce a list of top N 
> recurring items from a large collection? I've thought about sorting the 
> large log files externally, but reluctant to do this because of the 
> large size of the files and the need to frequently produce reports while 
> new logs arrive.

I take it by the references to sorting and descending order that you're
getting the top 50 elements by sorting the dict?  ie. something like:

l = [(d[url],url) for url in d]
l.sort()
top50 = l[:50]
for count,url in top50:
    print url, count

In which case you can avoid the overhead of duplicating a list, as well
as the O(N*lg(N)) sorting step by just iterating over the dictionary
and maintaining a 50-element list of the top 50.

Sample code: (Warning: Not tested much, needs python2.2):

import bisect

def get_top50(filename):
    d={}
    file = open(filename)
    for line in file:
        url=line.strip()  # strip \n, spaces
        d[url] = d.get(url,0) + 1
    file.close()

    top50 = []
    for url,count in d.iteritems():
        if (len(top50) < 50) or count > top50[0][0]:
            bisect.insort(top50,(count,url))
            if len(top50) > 50:
                top50.pop(0)   # Remove lowest element

    top50.reverse()  # Order highest -> lowest
    return top50

if __name__=='__main__':
    for count,url in get_top50('urls.txt'):
        print url, ":",count

You still have the dictionary in memory, but since you iterate over
the dictionary rather than creating any duplicate lists, you have much
less memory overhead.  Also reduced CPU overhead, since the algorithm
is O(N) wrt the number of urls, as opposed to O(N*lg(N)).

Hope this helps.

Brian.