[Python-Dev] pickling of large arrays

Guido van Rossum guido@python.org
Mon, 29 Jul 2002 11:09:22 -0400


> We are using Boost.Python to expose reference-counted C++ container
> types (similar to std::vector<>) to Python. E.g.:
> 
> from arraytbx import shared
> d = shared.double(1000000) # double array with a million elements
> c = shared.complex_double(100) # std::complex<double> array
> # and many more types, incl. several custom C++ types
> 
> We need a way to pickle these arrays. Since they can easily be
> converted to tuples we could just define functions like:
> 
>   def __getstate__(self):
>     return tuple(self)
> 
> However, since the arrays are potentially huge this could incur
> a large overhead (e.g. a tuple of a million Python float).
> Next idea:
> 
>   def __getstate__(self):
>     return iter(self)
> 
> Unfortunately (but not unexpectedly) pickle is telling me:
> 'can't pickle iterator objects'
> 
> Attached is a short Python script (tested with 2.2.1) with a prototype
> implementation of a pickle helper ("piece_meal") for large arrays.

That's a neat trick, unfortunately it only helps when the pickle is
being written directly to disk; when it is returned as a string, you
still get the entire array in memory.

> piece_meal's __getstate__ converts a block of a given size to a Python
> list and returns a tuple with that list and a new piece_meal instance
> which knows how to generate the next chunk. I.e. piece_meal instances
> are created recursively until the input sequence is exhausted. The
> corresponding __setstate__ puts the pieces back together again
> (uncomment the print statement to see the pieces).
> 
> I am wondering if a similar mechanism could be used to enable pickling
> of iterators, or maybe special "pickle_iterators", which would
> immediately enable pickling of our large arrays or any other object
> that can be iterated over (e.g. Numpy arrays which are currently
> pickled as potentially huge strings). Has this been discussed already?

I think pickling iterators is the wrong idea.  An iterator doesn't
represent data, it represents a single pass over data.  Iterators may
represent infinite series.

--Guido van Rossum (home page: http://www.python.org/~guido/)