space-efficient top-N algorithm

Terry Reedy tjreedy at
Mon Feb 10 16:20:01 CET 2003

> David Garamond <lists at> wrote in message
news:<mailman.1044791875.2092.python-list at>...
> > I am processing referrer log files with Python. The log files are
> > ASCII text and line-based, similar to a webserver log but produced
> > our banner management (web bug, more precisely) software. Each
> > contains, among other, a URL, a timestamp, and an IP address. One
of the
> > desired report is the list of top 50 recurring URLs within a
> > period (e.g. monthly or weekly).

> > What I currently do is just read the log files line by line, parse
> > into its elements, and store the URL string as the key of a
> > and the number of occurence as the value. In the end I would just
> > back to the dictionary and print the keys in descending order of
the values.

The distribution of counts should tend to be fairly consistent from
period to period.  A few urls with high counts, more with medium
counts, and lots with few counts, down to one.  With a bit of
experience, you should have a pretty good idea of how many counts are
required to be in top 50.  So I would consider scanning dict with
iteritems to extract candidates, aiming for, say, the top 100.  IE,
keep items with count above threshhold.  If get too few or start to
get too many, restart with adjusted threshhold. Then sort that reduced

Terry J. Reedy

More information about the Python-list mailing list