sorted list vs dictionnary

Thu May 16 08:44:05 EDT 2002

"Alexandre Fayolle" <alf at logilab.fr> wrote in message
news:slrnae76u4.jdt.alf at virgo.logilab.fr...
> Hello everyone,
>
> I have a class that is able to process messages. I want to avoid
> processing the same message twice, so a solution is to maintain a
set
> of processed messages IDs, and to check if for the presence of the
ID
> before processing a document.
>
> I've found two ways (without using anything outside the standard
> library) of doing this:
>
>  * use a dictionnary with the ID as the key and None as the value,
and
>  check for the key with cache.has_key(msg.ID)
>  * use a sorted list and the bisect module to keep it sorted, and
check
>  for the key with
>  i = bisect.bisect_left(l,e)
>  if i<len(l) and l[i] == e:
>      pass
>
> I'd expect the first one to be faster, since the lookup of a key in
a
> dictionnary is O(1), and insertion is a dictionnary is roughly O(1)
too,
> whereas bisect_left is O(log(N)), and inserting in the middle of a
list
> is O(N).
>
> Could anyone confirm the speed issue, and maybe share some insights
about
> the memory efficiency of both solutions, especially when the number
of
> IDs get big (the order of magnitude is 50000)? I'm not too familiar
with the
> implementation of lists and dictionnary in the interpreter.

For the reasons you discerned, a dictionary with a dummy value is
currently the standard way to maintain a set.  Currently, using 1
instead of None as the dummy is faster since it avoids the lookup of
the name 'None' in the module and builtin dicts, but this will change
when 'None' is made a keyword (maybe in 2.3).  I believe internal
memory consumption should be near enough the same to not be a worry.

Terry J. Reedy