Pickler/Unpickler API clarification
I'm working on some performance patches for cPickle, and one of the bigger wins so far has been replacing the Pickler's memo dict with a custom hashtable (and hence removing memo's getters and setters). In looking over this, Jeffrey Yasskin commented that this would break anyone who was accessing the memo attribute. I've found a few examples of code using the memo attribute ([1], [2], [3]), and there are probably more out there, but the memo attribute doesn't look like part of the API to me. It's only documented in http://docs.python.org/library/pickle.html as "you used to need this before Python 2.3, but don't anymore". However: I don't believe you should ever need this attribute. The usages of memo I've seen break down into two camps: clearing the memo, and wanting to explicitly populate the memo with predefined values. Clearing the memo is recommended as part of reusing Pickler objects, but I can't fathom when you would want to reuse a Pickler *without* clearing the memo. Reusing the Pickler without clearing the memo will produce pickles that are, as best I can see, invalid -- at least, pickletools.dis() rejects this, which is the closest thing we have to a validator. Explicitly setting memo values has the same problem: an easy, very brittle way to produce invalid data. So my questions are these: 1) Should Pickler/Unpickler objects automatically clear their memos when dumping/loading? 2) Is memo an intentionally exposed, supported part of the Pickler/Unpickler API, despite the lack of documentation and tests? Thanks, Collin [1] - http://google.com/codesearch/p?hl=en#Qx8E-7HUBTk/trunk/google/appengine/api/memcache/__init__.py&q=lang:py%20%5C.memo [2] - http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo [3] - http://google.com/codesearch/p?hl=en#l_w_cA4dKMY/AtlasAnalysis/2.0.3-LST-1/PhysicsAnalysis/PyAnalysis/PyAnalysisUtils/python/root_pickle.py&q=lang:py%20pick.*%5C.memo%5Cb
On Thu, Mar 5, 2009 at 12:07 PM, Collin Winter <collinw@gmail.com> wrote:
I'm working on some performance patches for cPickle, and one of the bigger wins so far has been replacing the Pickler's memo dict with a custom hashtable (and hence removing memo's getters and setters). In looking over this, Jeffrey Yasskin commented that this would break anyone who was accessing the memo attribute.
I've found a few examples of code using the memo attribute ([1], [2], [3]), and there are probably more out there, but the memo attribute doesn't look like part of the API to me. It's only documented in http://docs.python.org/library/pickle.html as "you used to need this before Python 2.3, but don't anymore". However: I don't believe you should ever need this attribute.
The usages of memo I've seen break down into two camps: clearing the memo, and wanting to explicitly populate the memo with predefined values. Clearing the memo is recommended as part of reusing Pickler objects, but I can't fathom when you would want to reuse a Pickler *without* clearing the memo. Reusing the Pickler without clearing the memo will produce pickles that are, as best I can see, invalid -- at least, pickletools.dis() rejects this, which is the closest thing we have to a validator.
I can explain this, as I invented this behavior. The use case was to have a long-lived session between a client and a server which were communicating repeatedly using pickles. The idea was that values that had been transferred once wouldn't have to be sent across the wire again -- they could just be referenced. This was a bad idea (*), and I'd be happy to ban it -- but we'd probably have to bump the pickle protocol version in order to maintain backwards compatibility.
Explicitly setting memo values has the same problem: an easy, very brittle way to produce invalid data.
Agreed this is just preposterous. It was never part of the plan.
So my questions are these: 1) Should Pickler/Unpickler objects automatically clear their memos when dumping/loading?
Alas, there could be backwards compatibility issues. Bumping the protocol could mitigate this.
2) Is memo an intentionally exposed, supported part of the Pickler/Unpickler API, despite the lack of documentation and tests?
The exposition is unintentional but for historic reasons we can't just remove it. :-(
Thanks, Collin
[1] - http://google.com/codesearch/p?hl=en#Qx8E-7HUBTk/trunk/google/appengine/api/memcache/__init__.py&q=lang:py%20%5C.memo [2] - http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo [3] - http://google.com/codesearch/p?hl=en#l_w_cA4dKMY/AtlasAnalysis/2.0.3-LST-1/PhysicsAnalysis/PyAnalysis/PyAnalysisUtils/python/root_pickle.py&q=lang:py%20pick.*%5C.memo%5Cb
__________ (*) http://code.google.com/p/googleappengine/issues/detail?id=417 -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
This was a bad idea (*), and I'd be happy to ban it -- but we'd probably have to bump the pickle protocol version in order to maintain backwards compatibility.
If you're talking about multiple calls to dump() on the same pickler, it might be a bad idea for a network connection, but I don't see anything wrong with using it on a file, and I find it useful to do so sometimes. Banning it would be excessive, IMO.
The exposition is unintentional but for historic reasons we can't just remove it. :-(
A compromise might be to provide a memo attribute that returns a wrapper around the underlying cache -- maybe with only a clear() method if that's all you want to support. -- Greg
On Thu, Mar 5, 2009 at 1:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Guido van Rossum wrote:
This was a bad idea (*), and I'd be happy to ban it -- but we'd probably have to bump the pickle protocol version in order to maintain backwards compatibility.
If you're talking about multiple calls to dump() on the same pickler, it might be a bad idea for a network connection, but I don't see anything wrong with using it on a file, and I find it useful to do so sometimes. Banning it would be excessive, IMO.
I don't think I was thinking of that when I first designed pickle but the use case makes some sense. I still wish we could ban it or somehow make it *not* the default behavior; the bug in the App Engine bug I referenced before was introduced by an experienced developer who wasn't aware of this behavior and was simply trying to avoid unnecessarily creating a new pickler for each call.
The exposition is unintentional but for historic reasons we can't just remove it. :-(
A compromise might be to provide a memo attribute that returns a wrapper around the underlying cache -- maybe with only a clear() method if that's all you want to support.
Then it'd be better to have a method clear_memo() on pickle objects. Perhaps we should do the opposite, and have a separate API for reuse *without* clearing the memo? <pickler>.dump_reusing_memo(<value>) and <unpickler>.load_reusing_memo(). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Collin Winter wrote:
Reusing the Pickler without clearing the memo will produce pickles that are, as best I can see, invalid
I'm not sure what you mean by "reusing the pickler" here, and how it can produce an invalid pickle. I think what the docs mean by it is continuing to pickle objects to the same file, but in a logically separate block that doesn't share any references with the previous one, e.g. pickle obj1 pickle obj2 ---clear memo--- pickle obj3 The whole thing is still a valid pickle containing 3 objects, whether the memo is cleared at any point or not, and can be unpickled using 3 corresponding unpickle calls to a single unpickler.
1) Should Pickler/Unpickler objects automatically clear their memos when dumping/loading?
If you mean should every call to Pickler.dump() or Unpickler.load() clear the memo first, definitely *NOT*. It's explicitly part of the specification that you can make multiple calls to dump() to build up a single pickle that shares state, as long as you unpickle it using a corresponding number of load() calls.
2) Is memo an intentionally exposed, supported part of the Pickler/Unpickler API, despite the lack of documentation and tests?
I think the 2.4 and later docs make it clear that it's no longer considered part of the public API, if it ever was. If seeding the memo is considered a legitimate need, an API could be provided for doing that. -- Greg
Collin Winter wrote:
[...] I've found a few examples of code using the memo attribute ([1], [2], [3]) [...]
As author of [2] (current version here [4]) I can tell you my reason. cvs2svn has to store a vast number of small objects in a database, then read them in random order. I spent a lot of time optimizing this part of the code because it is crucial for the overall performance when converting large CVS repositories. The objects are not all of the same class and sometimes contain other objects, so it is convenient to use pickling instead of, say, marshaling. It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size. So I "prime" the picklers/unpicklers by pickling then unpickling a "primer" that contains the classes that I know will appear, and storing the resulting memos once in the database. Then for each record I create a new pickler/unpickler and initialize its memo to the "primer"'s memo before using it to read the actual object. This removes a lot of redundancy across database records. I only prime my picklers/unpicklers with the classes. But note that the same technique could be used for any repeated subcomponents. This would have the added advantage that all of the unpickled instances would share copies of the objects that appear in the primer, which could be a semantic advantage and a significant savings in RAM in addition to the space and processing time advantages described above. It might even be a useful feature to the "shelve" module.
So my questions are these: 1) Should Pickler/Unpickler objects automatically clear their memos when dumping/loading? 2) Is memo an intentionally exposed, supported part of the Pickler/Unpickler API, despite the lack of documentation and tests?
For my application, either of the following alternatives would also suffice: - The ability to pickle picklers and unpicklers themselves (including their memos). This is, of course, awkward because they are hard-wired to files. - Picklers and unpicklers could have get_memo() and set_memo() methods that return an opaque (but pickleable) memo object. In other words, I don't need to muck around inside the memo object; I just need to be able to save and restore it. Please note that the memo for a pickler is not equal to the memo of the corresponding unpickler. A similar effect could *almost* be obtained without accessing the memos by saving the pickled primer itself in the database. The unpickler would be primed by using it to load the primer before loading the record of interest. But AFAIK there is no way to prime new picklers, because there is no guarantee that pickling the same primer twice will result in the same id->object mapping in the pickler's memo. Michael
[2] - http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo [4] http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/cvs2svn_lib/serializer...
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size.
If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084 Regards Antoine.
Antoine Pitrou wrote:
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size.
If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084
If I understand correctly, this would not help: - on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once. - on reading, the strings are only used to look up the class. Therefore they are garbage collected almost immediately. This is a different situation that that of attribute names, which are stored persistently as the keys in the instance's __dict__. Michael
Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
Antoine Pitrou wrote:
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size.
If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084
If I understand correctly, this would not help:
- on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once.
Then why did you say that "the space for these strings constituted a large fraction of the database size", if they are already shared? Are your objects so tiny that even the space taken by the pointer to the type name grows the size of the database significantly?
Antoine Pitrou wrote:
Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
Antoine Pitrou wrote:
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size. If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084 If I understand correctly, this would not help:
- on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once.
Then why did you say that "the space for these strings constituted a large fraction of the database size", if they are already shared? Are your objects so tiny that even the space taken by the pointer to the type name grows the size of the database significantly?
Sorry for the confusion. I thought you were suggesting the change to help the more typical use case, when a single Pickler is used for a lot of data. That use case will not be helped by interning the class __name__ and __module__ strings, for the reasons given in my previous email. In my case, the strings are shared via the Pickler memoizing mechanism because I pre-populate the memo (using the API that the OP proposes to remove), so your suggestion won't help my current code, either. It was before I implemented the pre-populated memoizer that "the space for these strings constituted a large fraction of the database size". But your suggestion wouldn't help that case, either. Here are the main use cases: 1. Saving and loading one large record. A class's __name__ string is the same string object every time it is retrieved, so it only needs to be stored once and the Pickler memo mechanism works. Similarly for the class's __module__ string. 2. Saving and loading lots of records sequentially. Provided a single Pickler is used for all records and its memo is never cleared, this works just as well as case 1. 3. Saving and loading lots of records in random order, as for example in the shelve module. It is not possible to reuse a Pickler with retained memo, because the Unpickler might not encounter objects in the right order. There are two subcases: a. Use a clean Pickler/Unpickler object for each record. In this case the __name__ and __module__ of a class will appear once in each record in which the class appears. (This is the case regardless of whether they are interned.) On reading, the __name__ and __module__ are only used to look up the class, so interning them won't help. It is thus impossible to avoid wasting a lot of space in the database. b. Use a Pickler/Unpickler with a preset memo for each record (my unorthodox technique). In this case the class __name__ and __module__ will be memoized in the shared memo, so in other records only their ID needs to be stored (in fact, only the ID of the class object itself). This allows the database to be smaller, but does not have any effect on the RAM usage of the loaded objects. If the OP's proposal is accepted, 3b will become impossible. The technique seems not to be well known, so maybe it doesn't need to be supported. It would mean some extra work for me on the cvs2svn project though :-( Michael
On Fri, Mar 6, 2009 at 10:01 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
Antoine Pitrou wrote:
Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
Antoine Pitrou wrote:
Michael Haggerty <mhagger <at> alum.mit.edu> writes:
It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size. If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084 If I understand correctly, this would not help:
- on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once.
Then why did you say that "the space for these strings constituted a large fraction of the database size", if they are already shared? Are your objects so tiny that even the space taken by the pointer to the type name grows the size of the database significantly?
Sorry for the confusion. I thought you were suggesting the change to help the more typical use case, when a single Pickler is used for a lot of data. That use case will not be helped by interning the class __name__ and __module__ strings, for the reasons given in my previous email.
In my case, the strings are shared via the Pickler memoizing mechanism because I pre-populate the memo (using the API that the OP proposes to remove), so your suggestion won't help my current code, either. It was before I implemented the pre-populated memoizer that "the space for these strings constituted a large fraction of the database size". But your suggestion wouldn't help that case, either.
Here are the main use cases:
1. Saving and loading one large record. A class's __name__ string is the same string object every time it is retrieved, so it only needs to be stored once and the Pickler memo mechanism works. Similarly for the class's __module__ string.
2. Saving and loading lots of records sequentially. Provided a single Pickler is used for all records and its memo is never cleared, this works just as well as case 1.
3. Saving and loading lots of records in random order, as for example in the shelve module. It is not possible to reuse a Pickler with retained memo, because the Unpickler might not encounter objects in the right order. There are two subcases:
a. Use a clean Pickler/Unpickler object for each record. In this case the __name__ and __module__ of a class will appear once in each record in which the class appears. (This is the case regardless of whether they are interned.) On reading, the __name__ and __module__ are only used to look up the class, so interning them won't help. It is thus impossible to avoid wasting a lot of space in the database.
b. Use a Pickler/Unpickler with a preset memo for each record (my unorthodox technique). In this case the class __name__ and __module__ will be memoized in the shared memo, so in other records only their ID needs to be stored (in fact, only the ID of the class object itself). This allows the database to be smaller, but does not have any effect on the RAM usage of the loaded objects.
If the OP's proposal is accepted, 3b will become impossible. The technique seems not to be well known, so maybe it doesn't need to be supported. It would mean some extra work for me on the cvs2svn project though :-(
Talking it over with Guido, support for the memo attribute will have to stay. I shall add it back to my patches. Collin
On Fri, Mar 6, 2009 at 5:45 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
If these strings are not interned, then perhaps they should be. There is a similar optimization proposal (w/ patch) for attribute names: http://bugs.python.org/issue5084
If I understand correctly, that would help with unpickling, but wouldn't solve Michael's problem as, without memo, each pickle would still need to store a copy. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
Antoine Pitrou wrote:
If these strings are not interned, then perhaps they should be.
I think this is a different problem. Even if the strings are interned, if you start with a fresh pickler each time, you get a copy of the strings in each pickle. What he wants is to share strings between different pickles. -- Greg
Michael Haggerty wrote:
A similar effect could *almost* be obtained without accessing the memos by saving the pickled primer itself in the database. The unpickler would be primed by using it to load the primer before loading the record of interest. But AFAIK there is no way to prime new picklers, because there is no guarantee that pickling the same primer twice will result in the same id->object mapping in the pickler's memo.
Would it help if, when creating a pickler or unpickler, you could specify another pickler or unpickler whose memo is used to initialise the memo of the new one? Then you could keep the pickler that you used to pickle the primer and "fork" new picklers off it, and similarly with the unpicklers. -- Greg
Greg Ewing wrote:
Michael Haggerty wrote:
A similar effect could *almost* be obtained without accessing the memos by saving the pickled primer itself in the database. The unpickler would be primed by using it to load the primer before loading the record of interest. But AFAIK there is no way to prime new picklers, because there is no guarantee that pickling the same primer twice will result in the same id->object mapping in the pickler's memo.
Would it help if, when creating a pickler or unpickler, you could specify another pickler or unpickler whose memo is used to initialise the memo of the new one?
Then you could keep the pickler that you used to pickle the primer and "fork" new picklers off it, and similarly with the unpicklers.
Typically, the purpose of a database is to persist data across program runs. So typically, your suggestion would only help if there were a way to persist the primed Pickler across runs. (The primed Unpickler is not quite so important because it can be primed by reading a pickle of the primer, which in turn can be stored somewhere in the DB.) In the particular case of cvs2svn, each of our databases is in fact written in a single pass, and then in later passes only read, not written. So I suppose we could do entirely without pickleable Picklers, if they were copyable within a single program run. But that constraint would make the feature even less general. Michael
On Sat, Mar 7, 2009 at 8:04 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
Typically, the purpose of a database is to persist data across program runs. So typically, your suggestion would only help if there were a way to persist the primed Pickler across runs.
I haven't followed all this, but isn't is at least possible to conceive of the primed pickler as being recreated from scratch from constant data each run?
(The primed Unpickler is not quite so important because it can be primed by reading a pickle of the primer, which in turn can be stored somewhere in the DB.)
In the particular case of cvs2svn, each of our databases is in fact written in a single pass, and then in later passes only read, not written. So I suppose we could do entirely without pickleable Picklers, if they were copyable within a single program run. But that constraint would make the feature even less general.
Being copyable is mostly equivalent to being picklable, but it's probably somewhat weaker because it's easier to define it as a pointer copy for some types that aren't easily picklable. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
On Sat, Mar 7, 2009 at 8:04 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
Typically, the purpose of a database is to persist data across program runs. So typically, your suggestion would only help if there were a way to persist the primed Pickler across runs.
I haven't followed all this, but isn't is at least possible to conceive of the primed pickler as being recreated from scratch from constant data each run?
If there were a guarantee that pickling the same data would result in the same memo ID -> object mapping, that would also work. But that doesn't seem to be a realistic guarantee to make. AFAIK the memo IDs are integers chosen consecutively in the order that objects are pickled, which doesn't seem so bad. But compound objects are a problem. For example, when pickling a map, the map entries would have to be pickled in an order that remains consistent across runs (and even across Python versions). Even worse, all user-written __getstate__() methods would have to return exactly the same result, even across program runs.
(The primed Unpickler is not quite so important because it can be primed by reading a pickle of the primer, which in turn can be stored somewhere in the DB.)
In the particular case of cvs2svn, each of our databases is in fact written in a single pass, and then in later passes only read, not written. So I suppose we could do entirely without pickleable Picklers, if they were copyable within a single program run. But that constraint would make the feature even less general.
Being copyable is mostly equivalent to being picklable, but it's probably somewhat weaker because it's easier to define it as a pointer copy for some types that aren't easily picklable.
Indeed. And pickling the memo should not present any fundamental problems, since by construction it can only contain pickleable objects. Michael
Michael Haggerty wrote:
Typically, the purpose of a database is to persist data across program runs. So typically, your suggestion would only help if there were a way to persist the primed Pickler across runs.
I don't think you need to be able to pickle picklers. In the case in question, the master pickler would be primed by pickling all the shared classes, and the resulting pickle would be stored in the database. When unpickling, the master unpickler would be primed by unpickling the shared pickle. -- Greg
participants (6)
-
Antoine Pitrou
-
Collin Winter
-
Daniel Stutzbach
-
Greg Ewing
-
Guido van Rossum
-
Michael Haggerty