space-efficient top-N algorithm
damien morton
morton at dennisinter.com
Mon Feb 10 11:10:20 EST 2003
from array import array
TOPN = 50 # we want the top-50 urls
N = 1<<20 # a large power of two
mask = N-1 # mask corresponds to N
# memory use should be constant 4*N, for N >> TOPN
urlhash = hash
# note: use of hash(url) could be replaced with an m5-derived hash function
# urlhash = lambda url: struct.unpack("i12x", md5.new("hello").digest())[0]
hashcount = array('l', [0]) * N
for url in open("logfile.txt").xreadlines():
hashcount[urlhash(url) & mask] += 1
sortedcount = filter(None, hashcount) # copy the hashcount array, omiting 0s
sortedcount.sort()
mincount = sortedcount[-TOPN*10]
del sortedcount
# now use mincount to filter out uninteresting urls
urlcount = {}
for url in open("logfile.txt").xreadlines():
if hashcount[urlhash(url) & mask] > mincount:
urlcount[url] = urlcount.get(url, 0) + 1
sortedurl = [(-n, url) for (url, n) in urlcount.items()]
sortedurl.sort()
sortedurl = [url for (n, url) in sortedurl[TOPN:]]
David Garamond <lists at zara.6.isreserved.com> wrote in message news:<mailman.1044791875.2092.python-list at python.org>...
> I am processing referrer log files with Python. The log files are in
> ASCII text and line-based, similar to a webserver log but produced by
> our banner management (web bug, more precisely) software. Each line
> contains, among other, a URL, a timestamp, and an IP address. One of the
> desired report is the list of top 50 recurring URLs within a certain
> period (e.g. monthly or weekly).
>
> What I currently do is just read the log files line by line, parse it
> into its elements, and store the URL string as the key of a dictionary
> and the number of occurence as the value. In the end I would just get
> back to the dictionary and print the keys in descending order of the values.
>
> However, the number of URLs are large and some of the URLs are long
> (>100 characters). My process grows into more than 100MB in size. I
> already cut the URLs to a max of 80 characters before entering them into
> the dictionary, but it doesn't help much.
>
> Is there a more space-efficient algorithm to produce a list of top N
> recurring items from a large collection? I've thought about sorting the
> large log files externally, but reluctant to do this because of the
> large size of the files and the need to frequently produce reports while
> new logs arrive.
More information about the Python-list
mailing list