space-efficient top-N algorithm

David Garamond lists at zara.6.isreserved.com
Sun Feb 9 20:24:37 EST 2003


Rene Pijlman wrote:
>>However, the number of URLs are large and some of the URLs are long 
>>(>100 characters). My process grows into more than 100MB in size. I 
>>already cut the URLs to a max of 80 characters before entering them into 
>> the dictionary, but it doesn't help much.
> 
> You could consider hashing the URL to a digest, using the md5 or
> sha module for example. But then you would need to make a second
> pass over the log file to translate the top-50 digests to their
> URLs.

Yes, that's a great idea. Thanks! And I should have thought about it 
too, since I remember reading the Google whitepaper some years ago where 
they did the same technique. (But then, since RAM and disk are so cheap 
nowadays, I seldom use my own memory anymore... :-) )

-- 
dave






More information about the Python-list mailing list