space-efficient top-N algorithm
lists at zara.6.isreserved.com
Sun Feb 9 18:56:06 CET 2003
I am processing referrer log files with Python. The log files are in
ASCII text and line-based, similar to a webserver log but produced by
our banner management (web bug, more precisely) software. Each line
contains, among other, a URL, a timestamp, and an IP address. One of the
desired report is the list of top 50 recurring URLs within a certain
period (e.g. monthly or weekly).
What I currently do is just read the log files line by line, parse it
into its elements, and store the URL string as the key of a dictionary
and the number of occurence as the value. In the end I would just get
back to the dictionary and print the keys in descending order of the values.
However, the number of URLs are large and some of the URLs are long
(>100 characters). My process grows into more than 100MB in size. I
already cut the URLs to a max of 80 characters before entering them into
the dictionary, but it doesn't help much.
Is there a more space-efficient algorithm to produce a list of top N
recurring items from a large collection? I've thought about sorting the
large log files externally, but reluctant to do this because of the
large size of the files and the need to frequently produce reports while
new logs arrive.
More information about the Python-list