[Python-Dev] Pickler/Unpickler API clarification

Michael Haggerty mhagger at alum.mit.edu
Fri Mar 6 10:57:00 CET 2009


Collin Winter wrote:
> [...] I've found a few examples of code using the memo attribute ([1], [2],
> [3]) [...]

As author of [2] (current version here [4]) I can tell you my reason.
cvs2svn has to store a vast number of small objects in a database, then
read them in random order.  I spent a lot of time optimizing this part
of the code because it is crucial for the overall performance when
converting large CVS repositories.  The objects are not all of the same
class and sometimes contain other objects, so it is convenient to use
pickling instead of, say, marshaling.

It is easy to optimize the pickling of instances by giving them
__getstate__() and __setstate__() methods.  But the pickler still
records the type of each object (essentially, the name of its class) in
each record.  The space for these strings constituted a large fraction
of the database size.

So I "prime" the picklers/unpicklers by pickling then unpickling a
"primer" that contains the classes that I know will appear, and storing
the resulting memos once in the database.  Then for each record I create
a new pickler/unpickler and initialize its memo to the "primer"'s memo
before using it to read the actual object.  This removes a lot of
redundancy across database records.

I only prime my picklers/unpicklers with the classes.  But note that the
same technique could be used for any repeated subcomponents.  This would
have the added advantage that all of the unpickled instances would share
copies of the objects that appear in the primer, which could be a
semantic advantage and a significant savings in RAM in addition to the
space and processing time advantages described above.  It might even be
a useful feature to the "shelve" module.

> So my questions are these:
> 1) Should Pickler/Unpickler objects automatically clear their memos
> when dumping/loading?
> 2) Is memo an intentionally exposed, supported part of the
> Pickler/Unpickler API, despite the lack of documentation and tests?

For my application, either of the following alternatives would also suffice:

- The ability to pickle picklers and unpicklers themselves (including
their memos).  This is, of course, awkward because they are hard-wired
to files.

- Picklers and unpicklers could have get_memo() and set_memo() methods
that return an opaque (but pickleable) memo object.  In other words, I
don't need to muck around inside the memo object; I just need to be able
to save and restore it.

Please note that the memo for a pickler is not equal to the memo of the
corresponding unpickler.

A similar effect could *almost* be obtained without accessing the memos
by saving the pickled primer itself in the database.  The unpickler
would be primed by using it to load the primer before loading the record
of interest.  But AFAIK there is no way to prime new picklers, because
there is no guarantee that pickling the same primer twice will result in
the same id->object mapping in the pickler's memo.

Michael

> [2] - http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo
[4]
http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/cvs2svn_lib/serializer.py?view=markup


More information about the Python-Dev mailing list