space-efficient top-N algorithm

John Machin sjmachin at lexicon.net
Sun Feb 9 16:27:05 EST 2003


Rene Pijlman <reageer.in at de.nieuwsgroep> wrote in message news:<uuhc4vsqeoff8db7l4fdqfcj5vi9keruht at 4ax.com>...
> 
> Another option would be to store your table in an external
> database. This effectively trades memory for disk space and
> runtime performance. Caching the data on frequently occurring
> URLs in memory is essential I think.
> 
> Perhaps these ideas can be combined:
> - store digest->count in a dictionary in memory
> - store the digest->URL mapping in an external database
> 
> This can be quite efficient when you add one bit to the data in
> memory, to indicate that the URL for this digest has already
> been written to the database: digest->(count,urlSaved).
> 
> In that way, there are no database queries needed to process the
> log file. The algorithm only inserts a record in the database
> when a new URL is encountered and you only query the database to
> construct the report.

I don't understand "quite efficient"; please explain. Looks to me like
either (a) the URL is neither in the dictionary nor in the database or
(b) the URL is in the dictionary and in the database. So the urlSaved
bit seems redundant. Note that adding that bit adds not only an extra
pointer to an int object for the "bit" (fortunately small ints like 0
and 1 are managed specially to save memory) but also necessitates a
tuple object to wrap up the (count, urlSaved) combination.

Another point: you haven't mentioned the possibility that multiple
distinct URLs may hash to the same digest.

One curiosity about the OP's "problem" of a 100Mb process size: how
many hours of developer time (for algorithm implementation and
maintenance) does it take before the cost of (say) a 256Mb memory chip
(including installation costs) is reached?




More information about the Python-list mailing list