space-efficient top-N algorithm

John Machin sjmachin at lexicon.net
Mon Feb 10 06:06:42 EST 2003


Rene Pijlman <reageer.in at de.nieuwsgroep> wrote in message news:<b7od4v09sprkl327vr9sn5d40tpnnl9euf at 4ax.com>...
> John Machin:
> >I don't understand "quite efficient"; please explain. Looks to me like
> >either (a) the URL is neither in the dictionary nor in the database or
> >(b) the URL is in the dictionary and in the database. 
> 
> No no. Only the digest is in the dictionary, and the digest is
> much shorter than the URL. The digest->URL mapping is in the
> database.

Yes yes yes. Given that you were ignoring collisions, URL == digest
for the purpose of my point, which was that the urlSaved bit is
redundant ... see below.

> Let me summarize in pseudocode...
> 
> Computation: 
>     URL -> digest
> Stored in memory:
>     digest -> (count,urlSaved)
> Stored in the database:
>     digest -> URL
> 
> Action per record:
>     digest(URL)[count] += 1
      # This will blow up the first time, because the key is not in
the
      # dict. You need to test if key in dict, which makes your extra
bit
      # redundant
      # And in any case what is digest(URL) that is subscriptable by
count????
>     if not digest(URL)[urlSaved]:
>         store digest->URL in database
>         digest(URL)[urlSaved] = True

Here's some valid Python:
   # action per record
   digest = some_hash(URL)
   if digest not in count_dict:
      store_in_database(digest, URL)
      count_dict[digest] = 1
   else:
      count_dict[digest] += 1
   # Look, Mum, no urlSaved

> 
> >Another point: you haven't mentioned the possibility that multiple
> >distinct URLs may hash to the same digest.
> 
> Yes, good point, I forgot to mention that. This is the price to
> pay for saving memory I guess, unless someone can come up with a
> plan that avoids this lossiness and still saves memory.

Sorry, I'm a bit old-fashioned, but this notion of jettisoning
accuracy to save memory is a bit of a mind-boggler. Just imagine
telling your CEO: "There is a non-zero chance that what you think is
the top referrer fubar.com really only had a count of 1 but
topsite.com with zillions of entries just happens to hash to the same
160 bit message digest as fubar so you never get to hear about
topsite. We did this just to save buying another 256 Mb memory module
for the computer ..."




More information about the Python-list mailing list