space-efficient top-N algorithm

Mon Feb 10 11:10:20 EST 2003

from array import array

TOPN = 50  # we want the top-50 urls
N = 1<<20  # a large power of two 
mask = N-1 # mask corresponds to N

# memory use should be constant 4*N, for N >> TOPN 

urlhash = hash
# note: use of hash(url) could be replaced with an m5-derived hash function
# urlhash = lambda url: struct.unpack("i12x", md5.new("hello").digest())[0]

hashcount = array('l', [0]) * N
for url in open("logfile.txt").xreadlines():
	hashcount[urlhash(url) & mask] += 1
	
sortedcount = filter(None, hashcount)  # copy the hashcount array, omiting 0s
sortedcount.sort()
mincount = sortedcount[-TOPN*10]
del sortedcount

# now use mincount to filter out uninteresting urls
urlcount = {}
for url in open("logfile.txt").xreadlines():
	if hashcount[urlhash(url) & mask] > mincount:
		urlcount[url] = urlcount.get(url, 0) + 1
		
sortedurl = [(-n, url) for (url, n) in urlcount.items()]
sortedurl.sort()
sortedurl = [url for (n, url) in sortedurl[TOPN:]]


David Garamond <lists at zara.6.isreserved.com> wrote in message news:<mailman.1044791875.2092.python-list at python.org>...
> I am processing referrer log files with Python. The log files are in 
> ASCII text and line-based, similar to a webserver log but produced by 
> our banner management (web bug, more precisely) software. Each line 
> contains, among other, a URL, a timestamp, and an IP address. One of the 
> desired report is the list of top 50 recurring URLs within a certain 
> period (e.g. monthly or weekly).
> 
> What I currently do is just read the log files line by line, parse it 
> into its elements, and store the URL string as the key of a dictionary 
> and the number of occurence as the value. In the end I would just get 
> back to the dictionary and print the keys in descending order of the values.
> 
> However, the number of URLs are large and some of the URLs are long 
> (>100 characters). My process grows into more than 100MB in size. I 
> already cut the URLs to a max of 80 characters before entering them into 
>   the dictionary, but it doesn't help much.
> 
> Is there a more space-efficient algorithm to produce a list of top N 
> recurring items from a large collection? I've thought about sorting the 
> large log files externally, but reluctant to do this because of the 
> large size of the files and the need to frequently produce reports while 
> new logs arrive.