Mailman 3 Efficient reading of binary data - NumPy-Discussion

Efficient reading of binary data

Nicolas Bigaouette

April 3, 2008

4:30 p.m.

Hi, I have a C program which outputs large (~GB) files. It is a simple binary dump of an array of structure containing 9 doubles. You can see this as a double 1D array of size 9*Stot (Stot being the allocated size of the array of structure). The 1D array represents a 3D array (Sx * Sy * Sz = Stot) containing 9 values per cell. I want to read these files in the most efficient way possible, and I would like to have your insight on this. Right now, the fastest way I found was: imzeros = zeros((Sy,Sz),dtype=float64,order='C') imex = imshow(imzeros) f = open(filename, 'rb') data = numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot) mask_Ex = numpy.arange(6,9*Stot,9) Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) The arrays will be big, so everything should be well optimized. I have multiple questions: 1) Should I change this: Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) to: imex.set_array(squeeze(data[mask].reshape((Sz,Sy,Sx), order='C').transpose()[:,:,z])) I mean, is I don't use a temporary variable, will it be faster or less memory hungry? 2) If not, is the operation "Ex = " update the variable data or create another one? Ideally I would like to only update it. Maybe this would be better: Ex[:,:,:] = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() Would it? 3) The machine where the code will be run might be big-endian. Is there a way for python to read the big-endian file and "translate" it automatically to little-endian? Something like "numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot, endianness='big')"? Thanx a lot! ;) Nicolas

Attachments:

attachment.htm (text/html — 2.0 KB)

Show replies by date

Robert Kern

April 2008

4:47 p.m.

On Thu, Apr 3, 2008 at 3:30 PM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:

...

Hi,

I have a C program which outputs large (~GB) files. It is a simple binary dump of an array of structure containing 9 doubles. You can see this as a double 1D array of size 9*Stot (Stot being the allocated size of the array of structure). The 1D array represents a 3D array (Sx * Sy * Sz = Stot) containing 9 values per cell.

I want to read these files in the most efficient way possible, and I would like to have your insight on this.

Right now, the fastest way I found was: imzeros = zeros((Sy,Sz),dtype=float64,order='C') imex = imshow(imzeros) f = open(filename, 'rb') data = numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot) mask_Ex = numpy.arange(6,9*Stot,9)

This is something you can do much, much more efficiently by using a slice instead of indexing with an integer array.

...

Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z]))

The arrays will be big, so everything should be well optimized. I have multiple questions:

1) Should I change this: Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) to: imex.set_array(squeeze(data[mask].reshape((Sz,Sy,Sx), order='C').transpose()[:,:,z])) I mean, is I don't use a temporary variable, will it be faster or less memory hungry?

No. The temporary exists whether you give it a name or not. If you use data[6::9] instead of data[mask], you won't be using any extra memory at all. The arrays will just be views into the original array.

...

2) If not, is the operation "Ex = " update the variable data or create another one?

It just reassigns the name "Ex" to a different object specified on the right-hand side of the assignment. The relevant question is whether expression on the right-hand side takes up more memory.

...

Ideally I would like to only update it. Maybe this would be better:

Ex[:,:,:] = data[mask].reshape((Sz,Sy,Sx), order='C').transpose()Would it?

If you use data[6::9] instead of data[mask], you should just use "Ex = " since no new memory will be used on the RHS.

...

3) The machine where the code will be run might be big-endian. Is there a way for python to read the big-endian file and "translate" it automatically to little-endian? Something like "numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot, endianness='big')"?

dtype=numpy.dtype('>f8') -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Nicolas Bigaouette

7:53 p.m.

Thanx for the fast response Robert ;) I changed my code to use the slice: E = data[6::9] It is indeed faster and less eat less memory. Great. Thanx for the endiannes! I knew there was something like this ;) I suspect that, in '>f8', "f" means float and "8" means 8 bytes?

...

So the next step would be to only read the needed data from the binary file... Is it possible to read from a file with a slice? So instead of: data = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot) E = data[6::9] maybe something like: E = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot, slice=6::9) Thank you! 2008/4/3, Robert Kern <robert.kern@gmail.com>:

...

On Thu, Apr 3, 2008 at 3:30 PM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:

...
Hi,

I have a C program which outputs large (~GB) files. It is a simple binary dump of an array of structure containing 9 doubles. You can see this as a double 1D array of size 9*Stot (Stot being the allocated size of the array of structure). The 1D array represents a 3D array (Sx * Sy * Sz = Stot) containing 9 values per cell.

I want to read these files in the most efficient way possible, and I would like to have your insight on this.

Right now, the fastest way I found was: imzeros = zeros((Sy,Sz),dtype=float64,order='C') imex = imshow(imzeros) f = open(filename, 'rb') data = numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot) mask_Ex = numpy.arange(6,9*Stot,9)

This is something you can do much, much more efficiently by using a slice instead of indexing with an integer array.

...
Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z]))

The arrays will be big, so everything should be well optimized. I have multiple questions:

1) Should I change this: Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) to: imex.set_array(squeeze(data[mask].reshape((Sz,Sy,Sx), order='C').transpose()[:,:,z])) I mean, is I don't use a temporary variable, will it be faster or less memory hungry?

No. The temporary exists whether you give it a name or not. If you use data[6::9] instead of data[mask], you won't be using any extra memory at all. The arrays will just be views into the original array.

...
2) If not, is the operation "Ex = " update the variable data or create another one?

It just reassigns the name "Ex" to a different object specified on the right-hand side of the assignment. The relevant question is whether expression on the right-hand side takes up more memory.

...
Ideally I would like to only update it. Maybe this would be better:

Ex[:,:,:] = data[mask].reshape((Sz,Sy,Sx), order='C').transpose()Would it?

If you use data[6::9] instead of data[mask], you should just use "Ex = " since no new memory will be used on the RHS.

...
3) The machine where the code will be run might be big-endian. Is there a way for python to read the big-endian file and "translate" it automatically to little-endian? Something like "numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot, endianness='big')"?

dtype=numpy.dtype('>f8')

-- Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Robert Kern

8 p.m.

On Thu, Apr 3, 2008 at 6:53 PM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:

...

Yes, and the '>' means big-endian. '<' is little-endian, and '=' is native-endian.

...

Instead of reading using fromfile(), you can try memory-mapping the array. from numpy import memmap E = memmap(f, dtype=float_dtype, mode='r')[6::9] That may or may not help. At least, it should decrease the latency before you start pulling out frames. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Nicolas Bigaouette

8:14 p.m.

2008/4/3, Robert Kern <robert.kern@gmail.com>:

...

I just tested it with a big-endian machine, it does work indeed great :)

...

Thanx

Sebastian Haase

4:50 a.m.

On Fri, Apr 4, 2008 at 2:14 AM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:

...

Hi, Accidentally I'm exactly trying to do the same thing right now ..... What is the best way of memmapping into a file that is already open !? I have to read some text (header info) off the beginning of the file before I know where the data actually starts. I could of course get the position at that point ( f.tell() ) close the file, and reopen using memmap. However this doesn't sound optimal to me .... Any hints ? Could numpy's memmap be changed to also accept file-objects, or there a "rule" that memmap always has to have access to the entire file ? Thanks, Sebastian Haase

Jarrod Millman

5:33 a.m.

On Fri, Apr 4, 2008 at 1:50 AM, Sebastian Haase <haase@msg.ucsf.edu> wrote:

...

I am getting a little tired, so this may be incorrect. But I believe Stefan modified memmaps to allow them to be created from file-like object: http://projects.scipy.org/scipy/numpy/changeset/4856 Are you running a released version of NumPy or the trunk? If you aren't using the trunk, could you give it a try? It would be good to have it tested before the 1.0.5 release. Cheers, -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/

Sebastian Haase

6:12 a.m.

On Fri, Apr 4, 2008 at 11:33 AM, Jarrod Millman <millman@berkeley.edu> wrote:

...

Hi Jarrod, Thanks for the reply. Indeed I'm running only N.__version__ '1.0.4.dev4312' I hope I find time to try the new feature. To clarify: if the file is already open, and the current position (f.tell() ) is somewhere in the middle, would the memmap "see" the file from there ? Could a "normal" file access and a concurrent memmap into that same file "step on each others feet" ? Thanks, Sebastian Haase

Christopher Barker

5:20 p.m.

Nicolas Bigaouette wrote:

...

So the next step would be to only read the needed data from the binary file...

You've gotten some suggestions, but another option is to use file.seek(0 to get where your data is, and numpy.fromfile() from there. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

6180

Age (days ago)

6181

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Christopher Barker
Jarrod Millman
Nicolas Bigaouette
Robert Kern
Sebastian Haase

Efficient reading of binary data

Nicolas Bigaouette

Robert Kern

Nicolas Bigaouette

Robert Kern

Nicolas Bigaouette

Sebastian Haase

Jarrod Millman

Sebastian Haase

Christopher Barker

tags

participants (5)