space-efficient top-N algorithm

Sun Feb 9 18:51:43 EST 2003

On Sunday 09 February 2003 16:56, David Garamond wrote:
> However, the number of URLs are large and some of the URLs are long
> (>100 characters). My process grows into more than 100MB in size. I
> already cut the URLs to a max of 80 characters before entering them into
>   the dictionary, but it doesn't help much.

--> If you don't need 100% precision for every URL, then you can store only 
hashes. There is small probability of collisions, which means that 
occurrences of two different URLs will be mapped to the same value. But the 
space savings are immense; it will take only 8 bytes to store the hash 
values, hence your structure will be much smaller. With good hashes the 
collision probability will be very low, and the end result will be about the 
same (although you can't guarantee it to be perfect).

--> If you need more precision, then you can adapt the technique above, by 
doing two passes over the log files:

1) in the first pass, use only the hashes, and return the top N*2 values. 
Think about these values as "candidates";
2) do a second pass, but filter the log entries in such a way that only the 
"candidate" URLs are analyzed. Now you have a much smaller set of records, 
which will take much less memory, and you can keep totals for every line.

The advantage of the method, as outlined above, is that it needs a fixed 
number of passes, and that it greatly improves the precision of the results. 
In fact, it guarantees that the ref count for every URL returned is correct, 
but there is still a very small probability that some URL may be missing from 
your top N list. But this will only happen if your hash function is seriously 
broken, and it's pretty easy to detect, because the second pass will yield 
(much) more than N*2 lines (indicating lots of collisions on the first pass).

> Is there a more space-efficient algorithm to produce a list of top N
> recurring items from a large collection? I've thought about sorting the
> large log files externally, but reluctant to do this because of the
> large size of the files and the need to frequently produce reports while
> new logs arrive.

It's interesting that you thought about external sorting; there are many 
techniques that are used in external sorting programs that you could try to 
use. The 'mergesort' algorithm is designed to work efficiently with external 
files. It may be possible to borrow some tricks from mergesort's book into 
you program. Please note that in any case you'll need more than one pass 
through the dataset, though; I can't imagine how can you possibly store less 
records in memory, while at the same time doing everything into a single 
pass.

Carlos Ribeiro
cribeiro at mail.inet.com.br