[Numpy-discussion] cPickle/unPickle across archs
Robert Kern
robert.kern at gmail.com
Thu Jan 7 17:30:24 EST 2010
On Thu, Jan 7, 2010 at 15:54, James Mazer <james.mazer at yale.edu> wrote:
> Hi,
>
> I've got a some Numeric arrays that were created without
> an explicit byte size in the initial declaration and pickled.
> Something like this:
>
> >>> cPickle.write(array(ones((3,3,)), 'f'), open('foo.pic', 'w'))
>
> as opposed to:
>
> >>> cPickle.write(array(ones((3,3,)), Float32), open('foo.pic', 'w'))
>
> This works as long as the word size doesn't change between the
> reading and writing machines.
>
> The data were generated under a 32bit linux kernel and now I'm trying
> to read them under a 64bit kernel, so the word size has changed and
> Numeric assumes that the 'f' type is the NATIVE float
Please note that 'f' is always a 32-bit float on any machine. Only
integers may change size.
> and 'l' type is
> the NATIVE long) and dies miserable when the native types don't match
> the actual types (which defeats the whole point of pickling, to some
> extent -- I thought that cPickle.save/load were "ensured" to be
> invertable...)
I don't think cPickle ensures much at all. It's actually rather
fragile for persisting data over long times and between different
environments. It works better as a wire format for communication
between similar codebases when thoroughly tested on both ends. Using a
standard scientific file format for storing your important data has
always been de rigeur.
That said, it is a deficiency in Numeric that it records the native
typecode instead of a platform-neutral, explicitly sized typecode.
Unfortunately, Numeric has been deprecated for many years now, and is
not maintained. Numeric's replacement, numpy, does not have this
problem.
> I've got terrabytes of data that need to be read by both 32bit and
> 64bit machines (and it's not really feasible to scan all the files
> into new structures with explict types on a 32bit machine). Anybody
> have hints for addressing this problem? I found similar questions,
> but no answers, so I'm not completely alone iwth this problem.
What you can do is monkeypatch the function
Numeric.array_constructor() to do "the right thing" for your case when
it sees a platform-specific integer typecode. Something like the
following (untested; you may need to generalize it to handle the
unsigned integer typecodes, too, if you have that kind of data):
import Numeric
i_size = Numeric.empty(0, 'i').itemsize()
def patched_array_constructor(shape, typecode, thestr,
Endian=Numeric.LittleEndian):
if typecode == "l":
# Ensure that the length of the data matches our expectations.
size = Numeric.product(shape)
itemsize = len(thestr) // size
if itemsize == i_size:
typecode = 'i'
if typecode == "O":
x = Numeric.array(thestr,"O")
else:
x = Numeric.fromstring(thestr, typecode)
x.shape = shape
if LittleEndian != Endian:
return x.byteswapped()
else:
return x
Numeric.array_constructor = patched_array_constructor
After you have done that, cPickle.load() will use that patched
function to reconstruct the arrays and make sure that the appropriate
typecode is used to interpret the data.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the NumPy-Discussion
mailing list