[Python-Dev] pickling of large arrays

Ralf W. Grosse-Kunstleve rwgk@yahoo.com
Thu, 20 Feb 2003 04:38:58 -0800 (PST)


This is question is related to PEP 307, "Extensions to the pickle protocol",
http://www.python.org/peps/pep-0307.html .

Apparently the new Pickle "protocol 2" provides a mechanism for
avoiding large temporaries, but only for lists and dicts (section
"Pickling of large lists and dicts" near the end). I am wondering if
the new protocol could also help us to eliminate large temporaries when
pickling Boost.Python extension classes.

We wrote an open source C++ array library with Boost.Python bindings.
For pickling we use the __getstate__, __setstate__ protocol. As it
stands pickling involves converting the arrays to Python strings,
similar to what is done in Numpy. There are two mechanisms:

1. "single buffered":

   For numeric types (int, long, double, etc.) a Python string is
   allocated based on an upper estimate for the required size
   (PyString_FromStringAndSize). The entire numeric array is converted
   directly to that string. Finally the Python string is resized
   (_PyString_Resize).
   With this mechanism there are 2 copies of the array in memory:
     - the original array and
     - the Python string.

2. "double buffered":

   For some user-defined element types it is very difficult to estimate
   an upper limit for the size of the string representation. Therefore
   the array is first converted to a dynamically growing C++
   std::string, which is then copied to a Python string.
   With this mechanism there are 3 copies of the array in memory:
     - the original array,
     - the std::string, and
     - the Python string.

For very large arrays the memory overhead can be a limiting factor.
Could the new protocol 2 help us in some way?

Thank you in advance,
        Ralf


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/