space-efficient top-N algorithm

Sun Feb 9 18:27:57 EST 2003

John Machin:
>I don't understand "quite efficient"; please explain. Looks to me like
>either (a) the URL is neither in the dictionary nor in the database or
>(b) the URL is in the dictionary and in the database. 

No no. Only the digest is in the dictionary, and the digest is
much shorter than the URL. The digest->URL mapping is in the
database.

Let me summarize in pseudocode...

Computation: 
    URL -> digest
Stored in memory:
    digest -> (count,urlSaved)
Stored in the database:
    digest -> URL

Action per record:
    digest(URL)[count] += 1
    if not digest(URL)[urlSaved]:
        store digest->URL in database
        digest(URL)[urlSaved] = True

Action per report:
    for all digests order by count where count > floor:
        get URL from database
        print URL, count

>Note that adding that bit adds not only an extra
>pointer to an int object 

Conceptually it is one bit. It can be encoded efficiently to
preserve memory, for example as the sign of the count.

The total amount of memory is approximately:

   number of URLs * 
       (sizeof(digest) + sizeof(count) + sizeof(bit))

... plus some overhead.

>Another point: you haven't mentioned the possibility that multiple
>distinct URLs may hash to the same digest.

Yes, good point, I forgot to mention that. This is the price to
pay for saving memory I guess, unless someone can come up with a
plan that avoids this lossiness and still saves memory.

-- 
René Pijlman

Wat wil jij leren?  http://www.leren.nl