partially reading a file...
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Hi, Let's say I want to read a (binary) file which contains a nx*ny*nz array. Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ? TIA. Cheers, -- Fred
![](https://secure.gravatar.com/avatar/40489da22d2dc0cc12596420bb810463.jpg?s=120&d=mm&r=g)
fred wrote:
Hi,
Let's say I want to read a (binary) file which contains a nx*ny*nz array.
Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array). Then, selecting out a block is as simple as slicing. -Travis
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Travis E. Oliphant a écrit :
fred wrote:
Hi,
Let's say I want to read a (binary) file which contains a nx*ny*nz array.
Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array).
Then, selecting out a block is as simple as slicing. Maybe I had to mention this: the aim is to cut in several files a "large" data file, _bigger_ than total available memory amount.
Does memmap still apply ? Cheers, -- Fred
![](https://secure.gravatar.com/avatar/40489da22d2dc0cc12596420bb810463.jpg?s=120&d=mm&r=g)
fred wrote:
Travis E. Oliphant a écrit :
fred wrote:
Hi,
Let's say I want to read a (binary) file which contains a nx*ny*nz array.
Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?
An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array).
Then, selecting out a block is as simple as slicing.
Maybe I had to mention this: the aim is to cut in several files a "large" data file, _bigger_ than total available memory amount.
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system. Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file. -Travis
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Travis E. Oliphant a écrit :
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system. No problem.
Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file. Ok. Thanks for the hint.
Cheers, -- Fred
![](https://secure.gravatar.com/avatar/40489da22d2dc0cc12596420bb810463.jpg?s=120&d=mm&r=g)
fred wrote:
Travis E. Oliphant a écrit :
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system.
No problem.
Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file.
Ok. Thanks for the hint.
More directly: Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested): a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>) Should work... -Travis
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Travis E. Oliphant a écrit :
More directly:
Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested):
a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>)
Should work... Travis: tons of thanks ! :-))
Cheers, -- Fred
![](https://secure.gravatar.com/avatar/6194b135cba546afa82516de1537de49.jpg?s=120&d=mm&r=g)
On Wed, Aug 6, 2008 at 8:14 PM, Travis E. Oliphant <oliphant@enthought.com> wrote:
fred wrote:
Travis E. Oliphant a écrit :
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system.
No problem.
Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file.
Ok. Thanks for the hint.
More directly:
Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested):
a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>)
Hi, We should already get used to using "//" instead "/" if we want the result to be integer: So that is:
b = a[:nx//4,:ny//4,:nz//4]
... I'm just trying to advertise the 'future' (py3.0 or today: 'python -Qnew' ) so-called "true division"-feature .... Cheers, Sebastian Haase
![](https://secure.gravatar.com/avatar/a4b6fc6884b391a7f10b8bd45c8caa69.jpg?s=120&d=mm&r=g)
Hi, I've been following the thread on 'partially reading a file' with some interest and have a related question. So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period. In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file. thanks, - dharhas
![](https://secure.gravatar.com/avatar/6194b135cba546afa82516de1537de49.jpg?s=120&d=mm&r=g)
On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina <Dharhas.Pothina@twdb.state.tx.us> wrote:
Hi,
I've been following the thread on 'partially reading a file' with some interest and have a related question.
So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.
Hi dharhas yes, you can do all these things, I'm doing this for 3d and 4d images files. What file format are you interested in ? I use MRC files ... Cheers, Sebastian Haase
![](https://secure.gravatar.com/avatar/a4b6fc6884b391a7f10b8bd45c8caa69.jpg?s=120&d=mm&r=g)
It isn't a standardized format. It is the output of a Fortran hydrodynamic circulation model called SELFE. The output files are fortran binaries. I could probably cycle through the files and convert them to netcdf one by one with a python script but it would be quicker and more space efficient if I could directly use the original outputs. thanks, - dharhas
"Sebastian Haase" <haase@msg.ucsf.edu> 8/7/2008 11:00 AM >>> On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina <Dharhas.Pothina@twdb.state.tx.us> wrote: Hi,
I've been following the thread on 'partially reading a file' with some interest and have a related question.
So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.
Hi dharhas yes, you can do all these things, I'm doing this for 3d and 4d images files. What file format are you interested in ? I use MRC files ... Cheers, Sebastian Haase _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/5c60cb604ce49389e7626e9eb0876189.jpg?s=120&d=mm&r=g)
Darhas, It's your files can be converted to netcdf (or grib), then we have a tool to do exactly what you want basically you'd run cdscan -x full.xml *.nc And it would generate an xml file that would simulate being a full file then using our cdms2 read module you would do f=cdms2.open('full.xml') data =f("var",time=('2008-1','2008-7')) It would figure out for you which files to open. You could even be more restrictive by selecting a sub region (latitude=(-20,20)) etc... for more info: http://cdat.sf.net C. Dharhas Pothina wrote:
Hi,
I've been following the thread on 'partially reading a file' with some interest and have a related question.
So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.
thanks,
- dharhas
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http:// projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/a4b6fc6884b391a7f10b8bd45c8caa69.jpg?s=120&d=mm&r=g)
There are some issues with converting to netcdf. Mainly the fact that there is no standard for unstructured grids in netcdf. Most of the tools work for structured grids. There have been a couple of attempts to come up with an unstructured grid netcdf standard but from what I can tell they petered out in 2006. We are struggling with this right now since we have a couple of different hydro models and are trying to define a common format so we can develop our analysis and vis tools. My present idea is to write a module that abstracts the details of each model format and allows me to load the data into python. Will your module work with unstructured grids? - dharhas
Charles Doutriaux <doutriaux1@llnl.gov> 8/7/2008 11:15 AM >>> Darhas,
It's your files can be converted to netcdf (or grib), then we have a tool to do exactly what you want basically you'd run cdscan -x full.xml *.nc And it would generate an xml file that would simulate being a full file then using our cdms2 read module you would do f=cdms2.open('full.xml') data =f("var",time=('2008-1','2008-7')) It would figure out for you which files to open. You could even be more restrictive by selecting a sub region (latitude=(-20,20)) etc... for more info: http://cdat.sf.net C. Dharhas Pothina wrote:
Hi,
I've been following the thread on 'partially reading a file' with some interest and have a related question.
So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.
In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.
thanks,
- dharhas
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http:// projects.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Now, let's say I have scatter data in a big binary file (stored in the form (xi, yi, zi, vi)), like on the snapshot, showing a "small" scatter. How can I cut the scatter efficiently in several files, as in the previous mail ? I can use memmap to "read" the whole file, but after ? It's more a algorithmic issue from my own point of view. TIA. Cheers, -- Fred
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Travis E. Oliphant a écrit :
Should work...
I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31 and I get the following message: File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line 193, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) ValueError: mmap length is greater than file size Is there a workaround to consider long integer (if this is the issue) ? TIA. Cheers, -- Fred
![](https://secure.gravatar.com/avatar/6194b135cba546afa82516de1537de49.jpg?s=120&d=mm&r=g)
On Thu, Aug 7, 2008 at 5:36 PM, fred <fredmfp@gmail.com> wrote:
Travis E. Oliphant a écrit :
Should work...
I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31 and I get the following message:
File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line 193, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) ValueError: mmap length is greater than file size
Is there a workaround to consider long integer (if this is the issue) ?
TIA.
Are you "really" on a 64-bit system ? Is this Linux ? Is your Python the original from the distro - or did you build it yourself ? Do a: >>> import sys;print sys.maxint Did you build numpy yourself or did you download a binary ? HTH, Sebastian
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Sebastian Haase a écrit :
Are you "really" on a 64-bit system ? Yes.
Is this Linux ? Yes.
Is your Python the original from the distro - or did you build it yourself ? Built myself.
Do a: >>> import sys;print sys.maxint I get the expected answer: 2**63-1.
Did you build numpy yourself or did you download a binary ? Built myself.
What's going on ? Cheers, -- Fred
![](https://secure.gravatar.com/avatar/6194b135cba546afa82516de1537de49.jpg?s=120&d=mm&r=g)
On Fri, Aug 8, 2008 at 10:36 AM, fred <fredmfp@gmail.com> wrote:
Sebastian Haase a écrit :
Are you "really" on a 64-bit system ? Yes.
Is this Linux ? Yes.
Is your Python the original from the distro - or did you build it yourself ? Built myself.
Do a: >>> import sys;print sys.maxint I get the expected answer: 2**63-1.
Did you build numpy yourself or did you download a binary ? Built myself.
What's going on ?
Don't know .... what is the size of the file you are trying to open again - in bytes ? What file system are you using (don't know if this is of any interest ....) ? -S.
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Sebastian Haase a écrit :
Don't know .... what is the size of the file you are trying to open again - in bytes ? -rw-r--r-- 1 fred users 5529600000 2008-08-07 16:31 input.sep
What file system are you using (don't know if this is of any interest ....) ? ext3
Cheers, -- Fred
![](https://secure.gravatar.com/avatar/6194b135cba546afa82516de1537de49.jpg?s=120&d=mm&r=g)
On Fri, Aug 8, 2008 at 12:17 PM, fred <fredmfp@gmail.com> wrote:
Sebastian Haase a écrit :
What file system are you using (don't know if this is of any interest ....) ? Hmmm, forget this thread.
A keyboard-to-chair interface problem.
Sorry.
That's O.K.
5529600000 / 1024 / 1024 / 1024 5.14984130859
So you are saying you are mem-mapping a 5.2 GB file without problem !? That's pretty neat ;-) - Sebastian
![](https://secure.gravatar.com/avatar/4049498c84d4a7a8f5f5afa5b86c4d2b.jpg?s=120&d=mm&r=g)
Sebastian Haase a écrit :
5529600000 / 1024 / 1024 / 1024 5.14984130859
So you are saying you are mem-mapping a 5.2 GB file without problem !?
That's pretty neat ;-) Dimensions were wrong in my code, yes. For 1200x1600x720, it works fine ;-)
Cheers, -- Fred
participants (5)
-
Charles Doutriaux
-
Dharhas Pothina
-
fred
-
Sebastian Haase
-
Travis E. Oliphant