[Numpy-discussion] cPickle/unPickle across archs

Robert Kern robert.kern at gmail.com
Thu Jan 7 17:30:24 EST 2010

On Thu, Jan 7, 2010 at 15:54, James Mazer <james.mazer at yale.edu> wrote:
> Hi,
> I've got a some Numeric arrays that were created without
> an explicit byte size in the initial declaration and pickled.
> Something like this:
>   >>> cPickle.write(array(ones((3,3,)), 'f'), open('foo.pic', 'w'))
> as opposed to:
>   >>> cPickle.write(array(ones((3,3,)), Float32), open('foo.pic', 'w'))
> This works as long as the word size doesn't change between the
> reading and writing machines.
> The data were generated under a 32bit linux kernel and now I'm trying
> to read them under a 64bit kernel, so the word size has changed and
> Numeric assumes that the 'f' type is the NATIVE float

Please note that 'f' is always a 32-bit float on any machine. Only
integers may change size.

> and 'l' type is
> the NATIVE long) and dies miserable when the native types don't match
> the actual types (which defeats the whole point of pickling, to some
> extent -- I thought that cPickle.save/load were "ensured" to be
> invertable...)

I don't think cPickle ensures much at all. It's actually rather
fragile for persisting data over long times and between different
environments. It works better as a wire format for communication
between similar codebases when thoroughly tested on both ends. Using a
standard scientific file format for storing your important data has
always been de rigeur.

That said, it is a deficiency in Numeric that it records the native
typecode instead of a platform-neutral, explicitly sized typecode.
Unfortunately, Numeric has been deprecated for many years now, and is
not maintained. Numeric's replacement, numpy, does not have this

> I've got terrabytes of data that need to be read by both 32bit and
> 64bit machines (and it's not really feasible to scan all the files
> into new structures with explict types on a 32bit machine). Anybody
> have hints for addressing this problem?  I found similar questions,
> but no answers, so I'm not completely alone iwth this problem.

What you can do is monkeypatch the function
Numeric.array_constructor() to do "the right thing" for your case when
it sees a platform-specific integer typecode. Something like the
following (untested; you may need to generalize it to handle the
unsigned integer typecodes, too, if you have that kind of data):

import Numeric

i_size = Numeric.empty(0, 'i').itemsize()

def patched_array_constructor(shape, typecode, thestr,
    if typecode == "l":
        # Ensure that the length of the data matches our expectations.
        size = Numeric.product(shape)
        itemsize = len(thestr) // size
        if itemsize == i_size:
            typecode = 'i'
    if typecode == "O":
        x = Numeric.array(thestr,"O")
        x = Numeric.fromstring(thestr, typecode)
    x.shape = shape
    if LittleEndian != Endian:
        return x.byteswapped()
        return x

Numeric.array_constructor = patched_array_constructor

After you have done that, cPickle.load() will use that patched
function to reconstruct the arrays and make sure that the appropriate
typecode is used to interpret the data.

Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

More information about the NumPy-Discussion mailing list