space-efficient top-N algorithm

David Garamond lists at zara.6.isreserved.com
Sun Feb 9 18:56:06 CET 2003


I am processing referrer log files with Python. The log files are in 
ASCII text and line-based, similar to a webserver log but produced by 
our banner management (web bug, more precisely) software. Each line 
contains, among other, a URL, a timestamp, and an IP address. One of the 
desired report is the list of top 50 recurring URLs within a certain 
period (e.g. monthly or weekly).

What I currently do is just read the log files line by line, parse it 
into its elements, and store the URL string as the key of a dictionary 
and the number of occurence as the value. In the end I would just get 
back to the dictionary and print the keys in descending order of the values.

However, the number of URLs are large and some of the URLs are long 
(>100 characters). My process grows into more than 100MB in size. I 
already cut the URLs to a max of 80 characters before entering them into 
  the dictionary, but it doesn't help much.

Is there a more space-efficient algorithm to produce a list of top N 
recurring items from a large collection? I've thought about sorting the 
large log files externally, but reluctant to do this because of the 
large size of the files and the need to frequently produce reports while 
new logs arrive.

-- 
dave






More information about the Python-list mailing list