[Numpy-discussion] Efficient reading of binary data

Nicolas Bigaouette nbigaouette at gmail.com
Thu Apr 3 19:53:02 EDT 2008


Thanx for the fast response Robert ;)

I changed my code to use the slice:
E = data[6::9]
It is indeed faster and less eat less memory. Great.

Thanx for the endiannes! I knew there was something like this ;) I suspect
that, in '>f8', "f" means float and "8" means 8 bytes?

>From some benchmarks, I see that the slowest thing is disk access. It can
slow the displaying of data from around 1sec (when data is in os cache or
buffer) to 8sec.

So the next step would be to only read the needed data from the binary
file... Is it possible to read from a file with a slice? So instead of:
data = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot)
E = data[6::9]
maybe something like:
E = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot, slice=6::9)

Thank you!


2008/4/3, Robert Kern <robert.kern at gmail.com>:
>
> On Thu, Apr 3, 2008 at 3:30 PM, Nicolas Bigaouette
> <nbigaouette at gmail.com> wrote:
> > Hi,
> >
> > I have a C program which outputs large (~GB) files. It is a simple
> binary
> > dump of an array of structure containing 9 doubles. You can see this as
> a
> > double 1D array of size 9*Stot (Stot being the allocated size of the
> array
> > of structure). The 1D array represents a 3D array (Sx * Sy * Sz = Stot)
> > containing 9 values per cell.
> >
> > I want to read these files in the most efficient way possible, and I
> would
> > like to have your insight on this.
> >
> > Right now, the fastest way I found was:
> > imzeros = zeros((Sy,Sz),dtype=float64,order='C')
> >  imex = imshow(imzeros)
> > f = open(filename, 'rb')
> > data = numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot)
> > mask_Ex = numpy.arange(6,9*Stot,9)
>
>
> This is something you can do much, much more efficiently by using a
> slice instead of indexing with an integer array.
>
>
> > Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose()
> >  imex.set_array(squeeze(Ex3D[:,:,z]))
> >
> > The arrays will be big, so everything should be well optimized. I have
> > multiple questions:
> >
> > 1) Should I change this:
> > Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose()
> >  imex.set_array(squeeze(Ex3D[:,:,z]))
> > to:
> >  imex.set_array(squeeze(data[mask].reshape((Sz,Sy,Sx),
> > order='C').transpose()[:,:,z]))
> > I mean, is I don't use a temporary variable, will it be faster or less
> > memory hungry?
>
>
> No. The temporary exists whether you give it a name or not. If you use
> data[6::9] instead of data[mask], you won't be using any extra memory
> at all. The arrays will just be views into the original array.
>
>
> > 2) If not, is the operation "Ex = " update the variable data or create
> > another one?
>
>
> It just reassigns the name "Ex" to a different object specified on the
> right-hand side of the assignment. The relevant question is whether
> expression on the right-hand side takes up more memory.
>
>
> > Ideally I would like to only update it. Maybe this would be
> > better:
> >
> > Ex[:,:,:] = data[mask].reshape((Sz,Sy,Sx), order='C').transpose()Would
> it?
>
>
> If you use data[6::9] instead of data[mask], you should just use "Ex =
> " since no new memory will be used on the RHS.
>
>
> > 3) The machine where the code will be run might be big-endian. Is there
> a
> > way for python to read the big-endian file and "translate" it
> automatically
> > to little-endian? Something like "numpy.fromfile(file=f,
> > dtype=numpy.float64, count=9*Stot, endianness='big')"?
>
>
> dtype=numpy.dtype('>f8')
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>   -- Umberto Eco
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080403/b6d39ff0/attachment.html>


More information about the NumPy-Discussion mailing list