PEP274 variation & comparison

Mon May 20 12:46:05 EDT 2002

My basic problem is to quickly populate an existing dictionary from lists
I already have. The two current ways to do this are a for loop or use dict()
to create a new dictionary and then do existing_dict.update(new_dict),
both of these have alot of overhead.

old_dict = {'foo':1, 'bar':1}
l = ['joe', 'jack', 'mary', 'bill', 'geoff', 'dmitri', 'fima', 'boris']
# for loop
for (i) in l:
  old_dict[i] = 1
# update
  old_dict.update(dict([ (i, 1) for i in l ]))
# New! defined below
  old_dict.populate(l)

The update method is more than 50% slower than the for loop, because we
have to create lots of tuple temporaries to pass to dict().

I first thought PEP274 dictionary comprehensions was what I needed, and it
will make the tuple temporaries go away.  There will be a small amount of
overhead for the temporary new dict we pass to update().  [I emailed
barry w/ a shorter commentary a few days ago, but haven't heard his
comments yet]

What I really want is Perl style assignment where I can specify a list
as the key, and another list or a scalar on the right hand side
%myhash = ()
@myhash{@l} = 1 # assigns names as hash keys, all values are 1
@myhash{@l} = (0..(length(@l))) # assigns name keys to 0 .. N values

This breaks the current rules in Python pretty badly.  All of a sudden
lists are allowed as the key of a dict, but they do something completely
different.  And I'd like to be able to use a tuple in the same way as a
list, but tuples are already legal dict keys with very different behavior.
So I decided to create the populate() method which allows you to do stuff like

# setup
old_dict = {'foo':1, 'bar':1}
l = ['joe', 'jack', 'mary', 'bill', 'geoff', 'dmitri', 'fima', 'boris']

# assigns keys of names in list, with value set to true
old_dict.populate(l)

# assigns keys of names in list, with value being their index in the list
old_dict.populate(l, range(len(l)))

# like the true example, but every value is 'HiMom'
old_dict.populate(l, 'HiMom') 

lastly, you might actually want to assign the same list or tuple as the
value for every key, so we add an override flag that forces lists in the
second argument to be treated like 'HiMom'

# every name has the value of the list range(len(l))
old_dict.populate(l, range(len(l)), 1)

After thinking about it for awhile the populate() method isn't as similar
to dict comprehensions as you might think.  dict comprehensions are a quick
way to produce a new list, populate() is a quick way to add to an existing
dict from a list.  The overlap would be in the update() method - a dict
comprehension would be much faster than creating tuples to pass to dict()
which can then be used in an old_dict.update(dict comprehension here)
They would both crush iterating over a list to insert into a dictionary
in speed.

Dict comprehensions would likely be faster for creating new dictionaries,
the populate() method needs at least one real python list (or tuple) to
hold the keys.  dict comprehensions skip that temporary.  If you are talking
about existing lists, then that temporary list has already been paid for.
If you are talking about modifying an existing dict then dict comprehensions
have a cost of one temporary dict to be created for the call to update()
as well as the cost of inserting into a dictionary that will be discarded.
(doesn't hurt too much, the hashes for the keys are calc'd for the temporary
and don't have to be recalc'd when we copy the temp dict to the old_dict)

Below is a context diff against Python 2.2.1 that implements populate()
Below that is a short benchmark script that shows the difference between
the for loop, update, and populate methods.  for takes six times as long as
populate, update takes 1.75 as long as for (all those temporary tuples!)
speed ups for the other ways to call populate are about the same

enjoy,

-jackdied

*** Python-2.2.1/Objects/dictobject.c   Thu Mar 28 15:36:50 2002
--- Python-2.2.1-altered/Objects/dictobject.c   Mon May 20 13:31:49 2002
***************
*** 1508,1513 ****
--- 1508,1603 ----
        return val;
  }

+ static PyObject *
+ dict_populate(register dictobject *mp, PyObject *args)
+ {
+   PyObject *key_list;
+   PyObject *arg_val = NULL, *list_as_scalar = NULL;
+   PyObject *key, *val;
+   int i, imax;
+   long hash;
+
+   if (!PyArg_ParseTuple(args, "O|OO:populate", &key_list, &arg_val, &list_as_scalar))
+     return NULL;
+
+   if (PyList_Check(key_list) || PyTuple_Check(key_list)) /* we would like to use PySequence_Check, but it returns true on strings! */
+     imax = PySequence_Fast_GET_SIZE(key_list);
+   else {
+     PyErr_SetString(PyExc_TypeError,
+                     "populate() requires a list or tuple as its first argument");
+     return NULL;
+   }
+
+   /* resize the mp dict, shamelessly stolen from PyDict_Merge() */
+   if ((mp->ma_fill + imax)*3 >= (mp->ma_mask+1)*2) {
+     if (dictresize(mp, (mp->ma_used + imax)*3/2) != 0)
+       return NULL;
+   }
+
+   /* Three possibilities */
+   /* One, second argument is a tuple or list AND list_as_scalar isn't set and true */
+   if (arg_val != NULL && ((PyList_Check(arg_val) || PyTuple_Check(arg_val)) &&
+                           (list_as_scalar == NULL || !PyObject_IsTrue(list_as_scalar)))) {
+     int j, jmax;
+
+     jmax = PySequence_Fast_GET_SIZE(arg_val);
+
+     for (i = 0, j = 0; i < imax; i++, j++) {
+       /* If we are out of values in arg_val, start padding with Py_None
+          alternatively we could do (the less readable/maintainable)
+          for (;i < jmax && i < imax;)
+            key = key_list[i]
+            val = arg_val[j]
+          for (; i < imax;)
+            key = key_list[i]
+            val = Py_None
+       */
+       if (j >= jmax)
+         val = Py_None;
+       else {
+         val = PySequence_Fast_GET_ITEM(arg_val, j);
+       }
+
+       key = PySequence_Fast_GET_ITEM(key_list, j);
+
+       Py_INCREF(key);
+       Py_INCREF(val);
+       hash = PyObject_Hash(key);
+       if (hash == -1) {
+         PyErr_SetString(PyExc_TypeError, "key is an unhashable"); /* needs better error ? */
+         return NULL;
+       }
+       insertdict(mp, key, hash, val);
+     }
+   }
+   /* Two, second argument is NULL (set all values to true) */
+   /* Three, set all values to the second arg */
+   else {
+     if (arg_val == NULL)
+       val = PyInt_FromLong(1); /* refcount starts at one */
+     else
+       val = arg_val;
+
+     for (i = 0; i < imax; i++) {
+       key = PySequence_Fast_GET_ITEM(key_list, i);
+       Py_INCREF(key);
+       Py_INCREF(val);
+       hash = PyObject_Hash(key);
+       if (hash == -1) {
+         PyErr_SetString(PyExc_TypeError, "key is an unhashable"); /* needs better error ? */
+         return NULL;
+       }
+       insertdict(mp, key, hash, val);
+     }
+
+     if (arg_val == NULL) {
+       Py_DECREF(val); /* reduce refcount on our FromLong */
+     }
+   }
+
+   Py_INCREF(Py_None);
+   return Py_None;
+ }

  static PyObject *
  dict_clear(register dictobject *mp)
***************
*** 1659,1664 ****
--- 1749,1757 ----
  static char setdefault_doc__[] =
  "D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if not D.has_key(k)";

+ static char populate_doc__[] =
+ "D.populate(l[,lv,b]) -> D[l[0..N]] = 1 if no second argument, D[l[0..N]] = lv[0..N] if lv is a list or tuple, D[l[0..N]] = lv otherwise, or
if b is true";
+
  static char popitem__doc__[] =
  "D.popitem() -> (k, v), remove and return some (key, value) pair as a\n\
  2-tuple; but raise KeyError if D is empty";
***************
*** 1697,1702 ****
--- 1790,1797 ----
         get__doc__},
        {"setdefault",  (PyCFunction)dict_setdefault,   METH_VARARGS,
         setdefault_doc__},
+       {"populate",  (PyCFunction)dict_populate,       METH_VARARGS,
+        populate_doc__},
        {"popitem",     (PyCFunction)dict_popitem,      METH_NOARGS,
         popitem__doc__},
        {"keys",        (PyCFunction)dict_keys,         METH_NOARGS,

import time
mp = {}
l = range(100)
cnt = 10000

def benchmark(count, func):
    beg = time.clock()
    for (i) in range(count):
        mp.clear()
        func()
    print str(func), str(time.clock() - beg)
    return

def forloop():
    for (g) in l:
        mp[g] = 1
    return

def pop():
    mp.populate(l)
    return

def update():
    mp.update(dict([ (i, 1) for i in l ]))
    return

benchmark(cnt, forloop)
benchmark(cnt, pop)
benchmark(cnt, update)

"""
My results
<function pop at 0x81961d4>     0.67
<function forloop at 0x819619c> 3.65
<function update at 0x819620c>  7.37
"""

-- 
_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com
http://www.email.com/?sr=signup