[Numpy-discussion] cPickle/unPickle across archs

Thu Jan 7 17:30:24 EST 2010

On Thu, Jan 7, 2010 at 15:54, James Mazer <james.mazer at yale.edu> wrote:
> Hi,
>
> I've got a some Numeric arrays that were created without
> an explicit byte size in the initial declaration and pickled.
> Something like this:
>
>   >>> cPickle.write(array(ones((3,3,)), 'f'), open('foo.pic', 'w'))
>
> as opposed to:
>
>   >>> cPickle.write(array(ones((3,3,)), Float32), open('foo.pic', 'w'))
>
> This works as long as the word size doesn't change between the
> reading and writing machines.
>
> The data were generated under a 32bit linux kernel and now I'm trying
> to read them under a 64bit kernel, so the word size has changed and
> Numeric assumes that the 'f' type is the NATIVE float

Please note that 'f' is always a 32-bit float on any machine. Only
integers may change size.

> and 'l' type is
> the NATIVE long) and dies miserable when the native types don't match
> the actual types (which defeats the whole point of pickling, to some
> extent -- I thought that cPickle.save/load were "ensured" to be
> invertable...)

I don't think cPickle ensures much at all. It's actually rather
fragile for persisting data over long times and between different
environments. It works better as a wire format for communication
between similar codebases when thoroughly tested on both ends. Using a
standard scientific file format for storing your important data has
always been de rigeur.

That said, it is a deficiency in Numeric that it records the native
typecode instead of a platform-neutral, explicitly sized typecode.
Unfortunately, Numeric has been deprecated for many years now, and is
not maintained. Numeric's replacement, numpy, does not have this
problem.

> I've got terrabytes of data that need to be read by both 32bit and
> 64bit machines (and it's not really feasible to scan all the files
> into new structures with explict types on a 32bit machine). Anybody
> have hints for addressing this problem?  I found similar questions,
> but no answers, so I'm not completely alone iwth this problem.

What you can do is monkeypatch the function
Numeric.array_constructor() to do "the right thing" for your case when
it sees a platform-specific integer typecode. Something like the
following (untested; you may need to generalize it to handle the
unsigned integer typecodes, too, if you have that kind of data):

import Numeric

i_size = Numeric.empty(0, 'i').itemsize()

def patched_array_constructor(shape, typecode, thestr,
Endian=Numeric.LittleEndian):
    if typecode == "l":
        # Ensure that the length of the data matches our expectations.
        size = Numeric.product(shape)
        itemsize = len(thestr) // size
        if itemsize == i_size:
            typecode = 'i'
    if typecode == "O":
        x = Numeric.array(thestr,"O")
    else:
        x = Numeric.fromstring(thestr, typecode)
    x.shape = shape
    if LittleEndian != Endian:
        return x.byteswapped()
    else:
        return x

Numeric.array_constructor = patched_array_constructor

After you have done that, cPickle.load() will use that patched
function to reconstruct the arrays and make sure that the appropriate
typecode is used to interpret the data.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco