[Python-Dev] undesireable unpickle behavior, proposed fix

Jake McGuire jake at youtube.com
Tue Jan 27 10:49:29 CET 2009


Instance attribute names are normally interned - this is done in  
PyObject_SetAttr (among other places).  Unpickling (in pickle and  
cPickle) directly updates __dict__ on the instance object.  This  
bypasses the interning so you end up with many copies of the strings  
representing your attribute names, which wastes a lot of space, both  
in RAM and in pickles of sequences of objects created from pickles.   
Note that the native python memcached client uses pickle to serialize  
objects.

 >>> import pickle
 >>> class C(object):
...   def __init__(self, x):
...     self.long_attribute_name = x
...
 >>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None),  
pickle.HIGHEST_PROTOCOL)) for i in range(100)],  
pickle.HIGHEST_PROTOCOL))
3658
 >>> len(pickle.dumps([C(None) for i in range(100)],  
pickle.HIGHEST_PROTOCOL))
1441
 >>>

Interning the strings on unpickling makes the pickles smaller, and at  
least for cPickle actually makes unpickling sequences of many objects  
slightly faster.  I have included proposed patches to cPickle.c and  
pickle.py, and would appreciate any feedback.

dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Modules/ 
cPickle.c cPickle.c
--- Downloads/Python-2.4.3/Modules/cPickle.c	2004-07-26  
22:22:33.000000000 -0700
+++ cPickle.c	2009-01-26 23:30:31.000000000 -0800
@@ -4258,6 +4258,8 @@
  	PyObject *state, *inst, *slotstate;
  	PyObject *__setstate__;
  	PyObject *d_key, *d_value;
+	PyObject *name;
+	char * key_str;
  	int i;
  	int res = -1;

@@ -4319,8 +4321,24 @@

  		i = 0;
  		while (PyDict_Next(state, &i, &d_key, &d_value)) {
-			if (PyObject_SetItem(dict, d_key, d_value) < 0)
-				goto finally;
+			/* normally the keys for instance attributes are
+			   interned.  we should try to do that here. */
+			if (PyString_CheckExact(d_key)) {
+				key_str = PyString_AsString(d_key);
+				name = PyString_FromString(key_str);
+				if (! name)
+					goto finally;
+
+				PyString_InternInPlace(&name);
+				if (PyObject_SetItem(dict, name, d_value) < 0) {
+					Py_DECREF(name);
+					goto finally;
+				}
+				Py_DECREF(name);
+			} else {
+				if (PyObject_SetItem(dict, d_key, d_value) < 0)
+					goto finally;
+			}
  		}
  		Py_DECREF(dict);
  	}

dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/ 
pickle.py pickle.py
--- Downloads/Python-2.4.3/Lib/pickle.py	2009-01-27 01:41:43.000000000  
-0800
+++ pickle.py	2009-01-27 01:41:31.000000000 -0800
@@ -1241,7 +1241,15 @@
              state, slotstate = state
          if state:
              try:
-                inst.__dict__.update(state)
+                d = inst.__dict__
+                try:
+                    for k,v in state.items():
+                        d[intern(k)] = v
+                # keys in state don't have to be strings
+                # don't blow up, but don't go out of our way
+                except TypeError:
+                    d.update(state)
+
              except RuntimeError:
                  # XXX In restricted execution, the instance's __dict__
                  # is not accessible.  Use the old way of unpickling



More information about the Python-Dev mailing list