space-efficient top-N algorithm
Carlos Ribeiro
cribeiro at mail.inet.com.br
Sun Feb 9 18:51:43 EST 2003
On Sunday 09 February 2003 16:56, David Garamond wrote:
> However, the number of URLs are large and some of the URLs are long
> (>100 characters). My process grows into more than 100MB in size. I
> already cut the URLs to a max of 80 characters before entering them into
> the dictionary, but it doesn't help much.
--> If you don't need 100% precision for every URL, then you can store only
hashes. There is small probability of collisions, which means that
occurrences of two different URLs will be mapped to the same value. But the
space savings are immense; it will take only 8 bytes to store the hash
values, hence your structure will be much smaller. With good hashes the
collision probability will be very low, and the end result will be about the
same (although you can't guarantee it to be perfect).
--> If you need more precision, then you can adapt the technique above, by
doing two passes over the log files:
1) in the first pass, use only the hashes, and return the top N*2 values.
Think about these values as "candidates";
2) do a second pass, but filter the log entries in such a way that only the
"candidate" URLs are analyzed. Now you have a much smaller set of records,
which will take much less memory, and you can keep totals for every line.
The advantage of the method, as outlined above, is that it needs a fixed
number of passes, and that it greatly improves the precision of the results.
In fact, it guarantees that the ref count for every URL returned is correct,
but there is still a very small probability that some URL may be missing from
your top N list. But this will only happen if your hash function is seriously
broken, and it's pretty easy to detect, because the second pass will yield
(much) more than N*2 lines (indicating lots of collisions on the first pass).
> Is there a more space-efficient algorithm to produce a list of top N
> recurring items from a large collection? I've thought about sorting the
> large log files externally, but reluctant to do this because of the
> large size of the files and the need to frequently produce reports while
> new logs arrive.
It's interesting that you thought about external sorting; there are many
techniques that are used in external sorting programs that you could try to
use. The 'mergesort' algorithm is designed to work efficiently with external
files. It may be possible to borrow some tricks from mergesort's book into
you program. Please note that in any case you'll need more than one pass
through the dataset, though; I can't imagine how can you possibly store less
records in memory, while at the same time doing everything into a single
pass.
Carlos Ribeiro
cribeiro at mail.inet.com.br
More information about the Python-list
mailing list