space-efficient top-N algorithm

Mon Feb 10 03:16:56 EST 2003

Carlos Ribeiro wrote:
   ...
> It *is* possible, however, to come up with a clever implementation that
> *statistically* selects the top N URLs, using the two heaps as auxiliar
> structures.

I shudder to think of the task of analyzing what this "statistically"
means here.  The dependency of what gets selected on the "bunching"
characteristics of the incoming sequence of data is frightening: I
wouldn't even know how to start characterizing the quantitative
aspects of the "bunching" and its distorting influence on selection.

It seems to me that, unless one can get professional advice from
some really top-notch statistician, this course of action is
fraught with far too much risk of systematic distortion.

> This algorithm is not deterministic; it cannot guarantee that you get the
> 'top N' elements. It depends on the size of the heaps, and also on the
> distribution of data. But it may be worth trying in some situations,
> specially if a deterministic approach turn out to be not viable.

If deterministic wasn't viable, I would at least try to pick a
non-deterministic one whose behavior I can analyze and explain.
It seems to me that the already-proposed idea of using a digest
to stand for each URL offers a better way out, though.

Alex