why is bytearray treated so inefficiently by pickle?

Tue Dec 6 15:25:02 EST 2011

On 12/6/11 7:27 PM, John Ladasky wrote:
> On a related note, pickling of arrays of float64 objects, as generated
> by the numpy package for example, are wildly inefficient with memory.
> A half-million float64's requires about 4 megabytes, but the pickle
> file I generated from a numpy.ndarray of this size was 42 megabytes.
>
> I know that numpy has its own pickle protocol, and that it's supposed
> to help with this problem.  Still, if this is a general problem with
> Python and pickling numbers, it might be worth solving it in the
> language itself.

It is. Use protocol=HIGHEST_PROTOCOL when dumping the array to a pickle.

[~]
|1> big = np.linspace(0.0, 1.0, 500000)

[~]
|2> import cPickle

[~]
|3> len(cPickle.dumps(big))
11102362

[~]
|4> len(cPickle.dumps(big, protocol=cPickle.HIGHEST_PROTOCOL))
4000135

The original conception for pickle was that it would have an ASCII 
representation for optimal cross-platform compatibility. These were the days 
when people still used FTP regularly, and you could easily (and silently!) screw 
up binary data if you sent it in ASCII mode by accident. This necessarily 
creates large files for numpy arrays. Further iterations on the pickling 
protocol let numpy use raw binary data in the pickle. However, for backwards 
compatibility, the default protocol is the one Python started out with. If you 
explicitly use the most recent protocol, then you will get the efficiency benefits.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco