[Python-Dev] pickling of large arrays
Guido van Rossum
guido@python.org
Thu, 20 Feb 2003 09:37:35 -0500
> This is question is related to PEP 307, "Extensions to the pickle protocol",
> http://www.python.org/peps/pep-0307.html .
>
> Apparently the new Pickle "protocol 2" provides a mechanism for
> avoiding large temporaries, but only for lists and dicts (section
> "Pickling of large lists and dicts" near the end). I am wondering if
> the new protocol could also help us to eliminate large temporaries when
> pickling Boost.Python extension classes.
>
> We wrote an open source C++ array library with Boost.Python bindings.
> For pickling we use the __getstate__, __setstate__ protocol. As it
> stands pickling involves converting the arrays to Python strings,
> similar to what is done in Numpy. There are two mechanisms:
>
> 1. "single buffered":
>
> For numeric types (int, long, double, etc.) a Python string is
> allocated based on an upper estimate for the required size
> (PyString_FromStringAndSize). The entire numeric array is converted
> directly to that string. Finally the Python string is resized
> (_PyString_Resize).
> With this mechanism there are 2 copies of the array in memory:
> - the original array and
> - the Python string.
>
> 2. "double buffered":
>
> For some user-defined element types it is very difficult to estimate
> an upper limit for the size of the string representation. Therefore
> the array is first converted to a dynamically growing C++
> std::string, which is then copied to a Python string.
> With this mechanism there are 3 copies of the array in memory:
> - the original array,
> - the std::string, and
> - the Python string.
>
> For very large arrays the memory overhead can be a limiting factor.
> Could the new protocol 2 help us in some way?
Probably, if you can switch from __getstate__ to __reduce__.
__reduce__ can return a tuple of up to 5 items now; the last two are
iterators (one for list-ish types, one for dict-ish types). If you
return an iterator that iterates over all the pieces of your array,
the array will be reconstituted at the other end using repeated calls
to obj.extend() or obj.append(). There's no need to derive from list
for this to work, all you need is methods extend() and append().
--Guido van Rossum (home page: http://www.python.org/~guido/)