
Hi, I have a C program which outputs large (~GB) files. It is a simple binary dump of an array of structure containing 9 doubles. You can see this as a double 1D array of size 9*Stot (Stot being the allocated size of the array of structure). The 1D array represents a 3D array (Sx * Sy * Sz = Stot) containing 9 values per cell. I want to read these files in the most efficient way possible, and I would like to have your insight on this. Right now, the fastest way I found was: imzeros = zeros((Sy,Sz),dtype=float64,order='C') imex = imshow(imzeros) f = open(filename, 'rb') data = numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot) mask_Ex = numpy.arange(6,9*Stot,9) Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) The arrays will be big, so everything should be well optimized. I have multiple questions: 1) Should I change this: Ex = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() imex.set_array(squeeze(Ex3D[:,:,z])) to: imex.set_array(squeeze(data[mask].reshape((Sz,Sy,Sx), order='C').transpose()[:,:,z])) I mean, is I don't use a temporary variable, will it be faster or less memory hungry? 2) If not, is the operation "Ex = " update the variable data or create another one? Ideally I would like to only update it. Maybe this would be better: Ex[:,:,:] = data[mask].reshape((Sz,Sy,Sx), order='C').transpose() Would it? 3) The machine where the code will be run might be big-endian. Is there a way for python to read the big-endian file and "translate" it automatically to little-endian? Something like "numpy.fromfile(file=f, dtype=numpy.float64, count=9*Stot, endianness='big')"? Thanx a lot! ;) Nicolas

On Thu, Apr 3, 2008 at 3:30 PM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:
This is something you can do much, much more efficiently by using a slice instead of indexing with an integer array.
No. The temporary exists whether you give it a name or not. If you use data[6::9] instead of data[mask], you won't be using any extra memory at all. The arrays will just be views into the original array.
2) If not, is the operation "Ex = " update the variable data or create another one?
It just reassigns the name "Ex" to a different object specified on the right-hand side of the assignment. The relevant question is whether expression on the right-hand side takes up more memory.
If you use data[6::9] instead of data[mask], you should just use "Ex = " since no new memory will be used on the RHS.
dtype=numpy.dtype('>f8') -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Thanx for the fast response Robert ;) I changed my code to use the slice: E = data[6::9] It is indeed faster and less eat less memory. Great. Thanx for the endiannes! I knew there was something like this ;) I suspect that, in '>f8', "f" means float and "8" means 8 bytes?
So the next step would be to only read the needed data from the binary file... Is it possible to read from a file with a slice? So instead of: data = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot) E = data[6::9] maybe something like: E = numpy.fromfile(file=f, dtype=float_dtype, count=9*Stot, slice=6::9) Thank you! 2008/4/3, Robert Kern <robert.kern@gmail.com>:

On Thu, Apr 3, 2008 at 6:53 PM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:
Yes, and the '>' means big-endian. '<' is little-endian, and '=' is native-endian.
Instead of reading using fromfile(), you can try memory-mapping the array. from numpy import memmap E = memmap(f, dtype=float_dtype, mode='r')[6::9] That may or may not help. At least, it should decrease the latency before you start pulling out frames. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Fri, Apr 4, 2008 at 2:14 AM, Nicolas Bigaouette <nbigaouette@gmail.com> wrote:
Hi, Accidentally I'm exactly trying to do the same thing right now ..... What is the best way of memmapping into a file that is already open !? I have to read some text (header info) off the beginning of the file before I know where the data actually starts. I could of course get the position at that point ( f.tell() ) close the file, and reopen using memmap. However this doesn't sound optimal to me .... Any hints ? Could numpy's memmap be changed to also accept file-objects, or there a "rule" that memmap always has to have access to the entire file ? Thanks, Sebastian Haase

On Fri, Apr 4, 2008 at 1:50 AM, Sebastian Haase <haase@msg.ucsf.edu> wrote:
I am getting a little tired, so this may be incorrect. But I believe Stefan modified memmaps to allow them to be created from file-like object: http://projects.scipy.org/scipy/numpy/changeset/4856 Are you running a released version of NumPy or the trunk? If you aren't using the trunk, could you give it a try? It would be good to have it tested before the 1.0.5 release. Cheers, -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/

On Fri, Apr 4, 2008 at 11:33 AM, Jarrod Millman <millman@berkeley.edu> wrote:
Hi Jarrod, Thanks for the reply. Indeed I'm running only N.__version__ '1.0.4.dev4312' I hope I find time to try the new feature. To clarify: if the file is already open, and the current position (f.tell() ) is somewhere in the middle, would the memmap "see" the file from there ? Could a "normal" file access and a concurrent memmap into that same file "step on each others feet" ? Thanks, Sebastian Haase

Nicolas Bigaouette wrote:
So the next step would be to only read the needed data from the binary file...
You've gotten some suggestions, but another option is to use file.seek(0 to get where your data is, and numpy.fromfile() from there. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Christopher Barker
-
Jarrod Millman
-
Nicolas Bigaouette
-
Robert Kern
-
Sebastian Haase