[Python-Dev] Pickler/Unpickler API clarification

Sat Mar 7 00:45:09 CET 2009

On Fri, Mar 6, 2009 at 10:01 AM, Michael Haggerty <mhagger at alum.mit.edu> wrote:
> Antoine Pitrou wrote:
>> Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
>>> Antoine Pitrou wrote:
>>>> Michael Haggerty <mhagger <at> alum.mit.edu> writes:
>>>>> It is easy to optimize the pickling of instances by giving them
>>>>> __getstate__() and __setstate__() methods.  But the pickler still
>>>>> records the type of each object (essentially, the name of its class) in
>>>>> each record.  The space for these strings constituted a large fraction
>>>>> of the database size.
>>>> If these strings are not interned, then perhaps they should be.
>>>> There is a similar optimization proposal (w/ patch) for attribute names:
>>>> http://bugs.python.org/issue5084
>>> If I understand correctly, this would not help:
>>>
>>> - on writing, the strings are identical anyway, because they are read
>>> out of the class's __name__ and __module__ fields.  Therefore the
>>> Pickler's usual memoizing behavior will prevent the strings from being
>>> written more than once.
>>
>> Then why did you say that "the space for these strings constituted a
>> large fraction of the database size", if they are already shared? Are
>> your objects so tiny that even the space taken by the pointer to the
>> type name grows the size of the database significantly?
>
> Sorry for the confusion.  I thought you were suggesting the change to
> help the more typical use case, when a single Pickler is used for a lot
> of data.  That use case will not be helped by interning the class
> __name__ and __module__ strings, for the reasons given in my previous email.
>
> In my case, the strings are shared via the Pickler memoizing mechanism
> because I pre-populate the memo (using the API that the OP proposes to
> remove), so your suggestion won't help my current code, either.  It was
> before I implemented the pre-populated memoizer that "the space for
> these strings constituted a large fraction of the database size".  But
> your suggestion wouldn't help that case, either.
>
> Here are the main use cases:
>
> 1. Saving and loading one large record.  A class's __name__ string is
> the same string object every time it is retrieved, so it only needs to
> be stored once and the Pickler memo mechanism works.  Similarly for the
> class's __module__ string.
>
> 2. Saving and loading lots of records sequentially.  Provided a single
> Pickler is used for all records and its memo is never cleared, this
> works just as well as case 1.
>
> 3. Saving and loading lots of records in random order, as for example in
> the shelve module.  It is not possible to reuse a Pickler with retained
> memo, because the Unpickler might not encounter objects in the right
> order.  There are two subcases:
>
>   a. Use a clean Pickler/Unpickler object for each record.  In this
> case the __name__ and __module__ of a class will appear once in each
> record in which the class appears.  (This is the case regardless of
> whether they are interned.)  On reading, the __name__ and __module__ are
> only used to look up the class, so interning them won't help.  It is
> thus impossible to avoid wasting a lot of space in the database.
>
>   b. Use a Pickler/Unpickler with a preset memo for each record (my
> unorthodox technique).  In this case the class __name__ and __module__
> will be memoized in the shared memo, so in other records only their ID
> needs to be stored (in fact, only the ID of the class object itself).
> This allows the database to be smaller, but does not have any effect on
> the RAM usage of the loaded objects.
>
> If the OP's proposal is accepted, 3b will become impossible.  The
> technique seems not to be well known, so maybe it doesn't need to be
> supported.  It would mean some extra work for me on the cvs2svn project
> though :-(

Talking it over with Guido, support for the memo attribute will have
to stay. I shall add it back to my patches.

Collin