[Python-Dev] undesireable unpickle behavior, proposed fix

Jesse Noller jnoller at gmail.com
Tue Jan 27 15:23:06 CET 2009


On Tue, Jan 27, 2009 at 4:49 AM, Jake McGuire <jake at youtube.com> wrote:
> Instance attribute names are normally interned - this is done in
> PyObject_SetAttr (among other places).  Unpickling (in pickle and cPickle)
> directly updates __dict__ on the instance object.  This bypasses the
> interning so you end up with many copies of the strings representing your
> attribute names, which wastes a lot of space, both in RAM and in pickles of
> sequences of objects created from pickles.  Note that the native python
> memcached client uses pickle to serialize objects.
>
>>>> import pickle
>>>> class C(object):
> ...   def __init__(self, x):
> ...     self.long_attribute_name = x
> ...
>>>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None),
>>>> pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL))
> 3658
>>>> len(pickle.dumps([C(None) for i in range(100)],
>>>> pickle.HIGHEST_PROTOCOL))
> 1441
>>>>
>
> Interning the strings on unpickling makes the pickles smaller, and at least
> for cPickle actually makes unpickling sequences of many objects slightly
> faster.  I have included proposed patches to cPickle.c and pickle.py, and
> would appreciate any feedback.
>
> dhcp-172-31-170-32:~ mcguire$ diff -u
> Downloads/Python-2.4.3/Modules/cPickle.c cPickle.c
> --- Downloads/Python-2.4.3/Modules/cPickle.c    2004-07-26
> 22:22:33.000000000 -0700
> +++ cPickle.c   2009-01-26 23:30:31.000000000 -0800
> @@ -4258,6 +4258,8 @@
>        PyObject *state, *inst, *slotstate;
>        PyObject *__setstate__;
>        PyObject *d_key, *d_value;
> +       PyObject *name;
> +       char * key_str;
>        int i;
>        int res = -1;
>
> @@ -4319,8 +4321,24 @@
>
>                i = 0;
>                while (PyDict_Next(state, &i, &d_key, &d_value)) {
> -                       if (PyObject_SetItem(dict, d_key, d_value) < 0)
> -                               goto finally;
> +                       /* normally the keys for instance attributes are
> +                          interned.  we should try to do that here. */
> +                       if (PyString_CheckExact(d_key)) {
> +                               key_str = PyString_AsString(d_key);
> +                               name = PyString_FromString(key_str);
> +                               if (! name)
> +                                       goto finally;
> +
> +                               PyString_InternInPlace(&name);
> +                               if (PyObject_SetItem(dict, name, d_value) <
> 0) {
> +                                       Py_DECREF(name);
> +                                       goto finally;
> +                               }
> +                               Py_DECREF(name);
> +                       } else {
> +                               if (PyObject_SetItem(dict, d_key, d_value) <
> 0)
> +                                       goto finally;
> +                       }
>                }
>                Py_DECREF(dict);
>        }
>
> dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/pickle.py
> pickle.py
> --- Downloads/Python-2.4.3/Lib/pickle.py        2009-01-27
> 01:41:43.000000000 -0800
> +++ pickle.py   2009-01-27 01:41:31.000000000 -0800
> @@ -1241,7 +1241,15 @@
>             state, slotstate = state
>         if state:
>             try:
> -                inst.__dict__.update(state)
> +                d = inst.__dict__
> +                try:
> +                    for k,v in state.items():
> +                        d[intern(k)] = v
> +                # keys in state don't have to be strings
> +                # don't blow up, but don't go out of our way
> +                except TypeError:
> +                    d.update(state)
> +
>             except RuntimeError:
>                 # XXX In restricted execution, the instance's __dict__
>                 # is not accessible.  Use the old way of unpickling
>

Hi Jake,

You should really post this to bugs.python.org as an enhancement so we
can track the discussion there.

-jesse


More information about the Python-Dev mailing list