undesireable unpickle behavior, proposed fix
import pickle class C(object): ... def __init__(self, x): ... self.long_attribute_name = x ... len(pickle.dumps([pickle.loads(pickle.dumps(C(None),
len(pickle.dumps([C(None) for i in range(100)],
Instance attribute names are normally interned - this is done in PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle) directly updates __dict__ on the instance object. This bypasses the interning so you end up with many copies of the strings representing your attribute names, which wastes a lot of space, both in RAM and in pickles of sequences of objects created from pickles. Note that the native python memcached client uses pickle to serialize objects. pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 3658 pickle.HIGHEST_PROTOCOL)) 1441
Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback. dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Modules/ cPickle.c cPickle.c --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26 22:22:33.000000000 -0700 +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800 @@ -4258,6 +4258,8 @@ PyObject *state, *inst, *slotstate; PyObject *__setstate__; PyObject *d_key, *d_value; + PyObject *name; + char * key_str; int i; int res = -1; @@ -4319,8 +4321,24 @@ i = 0; while (PyDict_Next(state, &i, &d_key, &d_value)) { - if (PyObject_SetItem(dict, d_key, d_value) < 0) - goto finally; + /* normally the keys for instance attributes are + interned. we should try to do that here. */ + if (PyString_CheckExact(d_key)) { + key_str = PyString_AsString(d_key); + name = PyString_FromString(key_str); + if (! name) + goto finally; + + PyString_InternInPlace(&name); + if (PyObject_SetItem(dict, name, d_value) < 0) { + Py_DECREF(name); + goto finally; + } + Py_DECREF(name); + } else { + if (PyObject_SetItem(dict, d_key, d_value) < 0) + goto finally; + } } Py_DECREF(dict); } dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/ pickle.py pickle.py --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27 01:41:43.000000000 -0800 +++ pickle.py 2009-01-27 01:41:31.000000000 -0800 @@ -1241,7 +1241,15 @@ state, slotstate = state if state: try: - inst.__dict__.update(state) + d = inst.__dict__ + try: + for k,v in state.items(): + d[intern(k)] = v + # keys in state don't have to be strings + # don't blow up, but don't go out of our way + except TypeError: + d.update(state) + except RuntimeError: # XXX In restricted execution, the instance's __dict__ # is not accessible. Use the old way of unpickling
On Tue, Jan 27, 2009 at 4:49 AM, Jake McGuire <jake@youtube.com> wrote:
Instance attribute names are normally interned - this is done in PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle) directly updates __dict__ on the instance object. This bypasses the interning so you end up with many copies of the strings representing your attribute names, which wastes a lot of space, both in RAM and in pickles of sequences of objects created from pickles. Note that the native python memcached client uses pickle to serialize objects.
import pickle class C(object): ... def __init__(self, x): ... self.long_attribute_name = x ... len(pickle.dumps([pickle.loads(pickle.dumps(C(None), pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 3658 len(pickle.dumps([C(None) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 1441
Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback.
dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Modules/cPickle.c cPickle.c --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26 22:22:33.000000000 -0700 +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800 @@ -4258,6 +4258,8 @@ PyObject *state, *inst, *slotstate; PyObject *__setstate__; PyObject *d_key, *d_value; + PyObject *name; + char * key_str; int i; int res = -1;
@@ -4319,8 +4321,24 @@
i = 0; while (PyDict_Next(state, &i, &d_key, &d_value)) { - if (PyObject_SetItem(dict, d_key, d_value) < 0) - goto finally; + /* normally the keys for instance attributes are + interned. we should try to do that here. */ + if (PyString_CheckExact(d_key)) { + key_str = PyString_AsString(d_key); + name = PyString_FromString(key_str); + if (! name) + goto finally; + + PyString_InternInPlace(&name); + if (PyObject_SetItem(dict, name, d_value) < 0) { + Py_DECREF(name); + goto finally; + } + Py_DECREF(name); + } else { + if (PyObject_SetItem(dict, d_key, d_value) < 0) + goto finally; + } } Py_DECREF(dict); }
dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/pickle.py pickle.py --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27 01:41:43.000000000 -0800 +++ pickle.py 2009-01-27 01:41:31.000000000 -0800 @@ -1241,7 +1241,15 @@ state, slotstate = state if state: try: - inst.__dict__.update(state) + d = inst.__dict__ + try: + for k,v in state.items(): + d[intern(k)] = v + # keys in state don't have to be strings + # don't blow up, but don't go out of our way + except TypeError: + d.update(state) + except RuntimeError: # XXX In restricted execution, the instance's __dict__ # is not accessible. Use the old way of unpickling
Hi Jake, You should really post this to bugs.python.org as an enhancement so we can track the discussion there. -jesse
On Tue, Jan 27, 2009 at 6:23 AM, Jesse Noller <jnoller@gmail.com> wrote:
On Tue, Jan 27, 2009 at 4:49 AM, Jake McGuire <jake@youtube.com> wrote:
Instance attribute names are normally interned - this is done in PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle) directly updates __dict__ on the instance object. This bypasses the interning so you end up with many copies of the strings representing your attribute names, which wastes a lot of space, both in RAM and in pickles of sequences of objects created from pickles. Note that the native python memcached client uses pickle to serialize objects.
import pickle class C(object): ... def __init__(self, x): ... self.long_attribute_name = x ... len(pickle.dumps([pickle.loads(pickle.dumps(C(None), pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 3658 len(pickle.dumps([C(None) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 1441
Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback.
dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Modules/cPickle.c cPickle.c --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26 22:22:33.000000000 -0700 +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800 @@ -4258,6 +4258,8 @@ PyObject *state, *inst, *slotstate; PyObject *__setstate__; PyObject *d_key, *d_value; + PyObject *name; + char * key_str; int i; int res = -1;
@@ -4319,8 +4321,24 @@
i = 0; while (PyDict_Next(state, &i, &d_key, &d_value)) { - if (PyObject_SetItem(dict, d_key, d_value) < 0) - goto finally; + /* normally the keys for instance attributes are + interned. we should try to do that here. */ + if (PyString_CheckExact(d_key)) { + key_str = PyString_AsString(d_key); + name = PyString_FromString(key_str); + if (! name) + goto finally; + + PyString_InternInPlace(&name); + if (PyObject_SetItem(dict, name, d_value) < 0) { + Py_DECREF(name); + goto finally; + } + Py_DECREF(name); + } else { + if (PyObject_SetItem(dict, d_key, d_value) < 0) + goto finally; + } } Py_DECREF(dict); }
dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/pickle.py pickle.py --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27 01:41:43.000000000 -0800 +++ pickle.py 2009-01-27 01:41:31.000000000 -0800 @@ -1241,7 +1241,15 @@ state, slotstate = state if state: try: - inst.__dict__.update(state) + d = inst.__dict__ + try: + for k,v in state.items(): + d[intern(k)] = v + # keys in state don't have to be strings + # don't blow up, but don't go out of our way + except TypeError: + d.update(state) + except RuntimeError: # XXX In restricted execution, the instance's __dict__ # is not accessible. Use the old way of unpickling
Hi Jake,
You should really post this to bugs.python.org as an enhancement so we can track the discussion there.
-jesse
Seconded, with eagerness -- interning attribute names when unpickling makes a lot of sense! -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback.
Please submit patches always to the bug tracker. On the proposed change: While it is fairly unintrusive, I would like to propose a different approach - pickle interned strings special. The marshal module already uses this approach, and it should extend to pickle (although it would probably require a new protocol). On pickling, inspect each string and check whether it is interned. If so, emit a different code, and record it into the object id dictionary. On a second occurrence of the string, only pickle a backward reference. (Alternatively, check whether pickling the same string a second time would be more compact). On unpickling, support the new code to intern the result strings; subsequent references to it will go to the standard backreferencing algorithm. Regards, Martin
On Tue, Jan 27, 2009 at 10:43 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback.
Please submit patches always to the bug tracker.
On the proposed change: While it is fairly unintrusive, I would like to propose a different approach - pickle interned strings special. The marshal module already uses this approach, and it should extend to pickle (although it would probably require a new protocol).
On pickling, inspect each string and check whether it is interned. If so, emit a different code, and record it into the object id dictionary. On a second occurrence of the string, only pickle a backward reference. (Alternatively, check whether pickling the same string a second time would be more compact).
On unpickling, support the new code to intern the result strings; subsequent references to it will go to the standard backreferencing algorithm.
Hm. This would change the pickling format though. Wouldn't just interning (short) strings on unpickling be simpler? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Hm. This would change the pickling format though. Wouldn't just interning (short) strings on unpickling be simpler?
Sure - that's what Jake had proposed. However, it is always difficult to select which strings to intern - his heuristics (IIUC) is to intern all strings that appear as dictionary keys. Whether this is good enough, I don't know. In particular, it might intern very large strings that aren't identifiers at all. Regards, Martin
On Tue, Jan 27, 2009 at 11:40 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Hm. This would change the pickling format though. Wouldn't just interning (short) strings on unpickling be simpler?
Sure - that's what Jake had proposed. However, it is always difficult to select which strings to intern - his heuristics (IIUC) is to intern all strings that appear as dictionary keys. Whether this is good enough, I don't know. In particular, it might intern very large strings that aren't identifiers at all.
Just set a size limit, e.g. 30 or 100. It's just a heuristic. I believe somewhere in Python itself I intern string literals if they are reasonably short and fit the pattern of an identifier; I'd worry that the pattern matching would slow down unpickling more than the expected benefit though, so perhaps just a size test would be better. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Just set a size limit, e.g. 30 or 100. It's just a heuristic. I believe somewhere in Python itself I intern string literals if they are reasonably short and fit the pattern of an identifier; I'd worry that the pattern matching would slow down unpickling more than the expected benefit though, so perhaps just a size test would be better.
Ok. So, Jake, it's back to my original request - please submit this to the tracker (preferably along with test cases). Regards, Martin
On Jan 27, 2009, at 11:40 AM, Martin v. Löwis wrote:
Hm. This would change the pickling format though. Wouldn't just interning (short) strings on unpickling be simpler?
Sure - that's what Jake had proposed. However, it is always difficult to select which strings to intern - his heuristics (IIUC) is to intern all strings that appear as dictionary keys. Whether this is good enough, I don't know. In particular, it might intern very large strings that aren't identifiers at all.
I may have misunderstood how unpickling works, but I believe that my path only interns strings that are keys in a dictionary used to populate an instance. This is very similar to how instance creation and modification works in Python now. The only difference is if you set an attribute via "inst.__dict__['attribute_name'] = value" then 'attribute_name' will not be automatically interned, but if you pickle the instance, 'attribute_name' will be interned on unpickling. There may be cases where users specifically go through __dict__ to avoid interning attribute names, but I would be surprised to hear about it and very interested in talking to the person who did that. Creating a new pickle protocol to handle this case seems excessive... -jake
On Jan 27, 2009, at 12:39 PM, Martin v. Löwis wrote:
I may have misunderstood how unpickling works
Perhaps I have misunderstood your patch. Posting it to Rietveld might also be useful.
It is not immediately clear to me how Rietveld works. But I have created an issue on tracker: http://bugs.python.org/issue5084 Another vaguely related change would be to store string and unicode objects in the pickler memo keyed as themselves rather than their object ids. Depending on the data set, you can have many copies of the same string, e.g. "application/octet-stream". This may marginally increase memory usage during pickling, depending on the data being pickled and the way in which the code was written. I'm happy to write this up if people are interested... -jake
On Tue, Jan 27, 2009 at 5:16 PM, Jake McGuire <jake@youtube.com> wrote:
Another vaguely related change would be to store string and unicode objects in the pickler memo keyed as themselves rather than their object ids.
That wouldn't be difficult to do--i.e., simply add a type check in Pickler.memoize and another in Pickler.save(). But I am not sure if that would be a good idea, since you would end up hashing every string pickled. And, that would probably be expensive if you are pickling for long strings. -- Alexandre
participants (5)
-
"Martin v. Löwis"
-
Alexandre Vassalotti
-
Guido van Rossum
-
Jake McGuire
-
Jesse Noller