Mailman 3 partially reading a file... - SciPy-User - python.org

newer
concatenate array with number

partially reading a file...

older
Python tools at the annual SIAM...

fred

6 Aug 2008 6 Aug '08

3:54 p.m.

Hi, Let's say I want to read a (binary) file which contains a nx*ny*nz array. Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ? TIA. Cheers, -- Fred

Reply

Sign in to reply online Use email software

Show replies by date

Travis E. Oliphant

6 Aug 6 Aug

5:16 p.m.

New subject: [SciPy-user] partially reading a file...

fred wrote:

Hi,

Let's say I want to read a (binary) file which contains a nx*ny*nz array.

Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?

An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array). Then, selecting out a block is as simple as slicing. -Travis

Reply

Sign in to reply online Use email software

fred

5:30 p.m.

New subject: [SciPy-user] partially reading a file...

Travis E. Oliphant a écrit :

fred wrote:

...
Hi,

Let's say I want to read a (binary) file which contains a nx*ny*nz array.

Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?

An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array).

Then, selecting out a block is as simple as slicing. Maybe I had to mention this: the aim is to cut in several files a "large" data file, _bigger_ than total available memory amount.

Does memmap still apply ? Cheers, -- Fred

Reply

Sign in to reply online Use email software

Travis E. Oliphant

8:43 p.m.

New subject: [SciPy-user] partially reading a file...

fred wrote:

Travis E. Oliphant a écrit :

...
fred wrote:

...
Hi,

Let's say I want to read a (binary) file which contains a nx*ny*nz array.

Is it possible to read a "sub-array" from this file, ie each block of (nx/4, ny/4, nz/4) for instance, without loading the whole file ?

An easy way to do this which forces the operating system to do the work of partial loading is to use a memory mapped file as the source of the array (i.e. a memmap array).

Then, selecting out a block is as simple as slicing.

Maybe I had to mention this: the aim is to cut in several files a "large" data file, _bigger_ than total available memory amount.

Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system. Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file. -Travis

Reply

Sign in to reply online Use email software

fred

8:53 p.m.

New subject: [SciPy-user] partially reading a file...

Travis E. Oliphant a écrit :

Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system. No problem.

Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file. Ok. Thanks for the hint.

Cheers, -- Fred

Reply

Sign in to reply online Use email software

Travis E. Oliphant

9:14 p.m.

New subject: [SciPy-user] partially reading a file...

fred wrote:

Travis E. Oliphant a écrit :

...
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system.

No problem.

...
Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file.

Ok. Thanks for the hint.

More directly: Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested): a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>) Should work... -Travis

Reply

Sign in to reply online Use email software

fred

9:18 p.m.

New subject: [SciPy-user] partially reading a file...

Travis E. Oliphant a écrit :

More directly:

Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested):

a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>)

Should work... Travis: tons of thanks ! :-))

Cheers, -- Fred

Reply

Sign in to reply online Use email software

Sebastian Haase

10:06 p.m.

New subject: [SciPy-user] partially reading a file...

On Wed, Aug 6, 2008 at 8:14 PM, Travis E. Oliphant <oliphant@enthought.com> wrote:

fred wrote:

...
Travis E. Oliphant a écrit :

...
Absolutely memory mapping still applies --- it's a perfect application for it. But, you will probably need a 64-bit system.

No problem.

...
Memory mapping is how the OS handles "virtual memory" which uses disk space to increase main memory. You are just using that idea directly with a memory mapped file.

Ok. Thanks for the hint.

More directly:

Use numpy.memmap --- look at the docstring for example use and help on all the arguments available. But, something like this (untested):

a = numpy.memmap(<filename>, mode='r', dtype=float, shape=(nx,ny,nz)) b = a[:nx/4,:ny/4,:nz/4] b.tofile(<somefilename>)

Hi, We should already get used to using "//" instead "/" if we want the result to be integer: So that is:

b = a[:nx//4,:ny//4,:nz//4]

... I'm just trying to advertise the 'future' (py3.0 or today: 'python -Qnew' ) so-called "true division"-feature .... Cheers, Sebastian Haase

Reply

Sign in to reply online Use email software

Dharhas Pothina

7 Aug 7 Aug

4:19 p.m.

New subject: [SciPy-user] Mapping a series of files.

Hi, I've been following the thread on 'partially reading a file' with some interest and have a related question. So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period. In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file. thanks, - dharhas

Reply

Sign in to reply online Use email software

Sebastian Haase

7 p.m.

New subject: [SciPy-user] Mapping a series of files.

On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina <Dharhas.Pothina@twdb.state.tx.us> wrote:

Hi,

I've been following the thread on 'partially reading a file' with some interest and have a related question.

So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.

In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.

Hi dharhas yes, you can do all these things, I'm doing this for 3d and 4d images files. What file format are you interested in ? I use MRC files ... Cheers, Sebastian Haase

Reply

Sign in to reply online Use email software

Dharhas Pothina

7:24 p.m.

New subject: [SciPy-user] Mapping a series of files.

It isn't a standardized format. It is the output of a Fortran hydrodynamic circulation model called SELFE. The output files are fortran binaries. I could probably cycle through the files and convert them to netcdf one by one with a python script but it would be quicker and more space efficient if I could directly use the original outputs. thanks, - dharhas

...
...
"Sebastian Haase" <haase@msg.ucsf.edu> 8/7/2008 11:00 AM >>> On Thu, Aug 7, 2008 at 3:19 PM, Dharhas Pothina <Dharhas.Pothina@twdb.state.tx.us> wrote: Hi,

I've been following the thread on 'partially reading a file' with some interest and have a related question.

So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.

In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.

Hi dharhas yes, you can do all these things, I'm doing this for 3d and 4d images files. What file format are you interested in ? I use MRC files ... Cheers, Sebastian Haase _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user

Reply

Sign in to reply online Use email software

Charles Doutriaux

7:15 p.m.

New subject: [SciPy-user] Mapping a series of files.

Darhas, It's your files can be converted to netcdf (or grib), then we have a tool to do exactly what you want basically you'd run cdscan -x full.xml *.nc And it would generate an xml file that would simulate being a full file then using our cdms2 read module you would do f=cdms2.open('full.xml') data =f("var",time=('2008-1','2008-7')) It would figure out for you which files to open. You could even be more restrictive by selecting a sub region (latitude=(-20,20)) etc... for more info: http://cdat.sf.net C. Dharhas Pothina wrote:

Hi,

I've been following the thread on 'partially reading a file' with some interest and have a related question.

So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.

In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.

thanks,

- dharhas

_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http:// projects.scipy.org/mailman/listinfo/scipy-user

Reply

Sign in to reply online Use email software

Dharhas Pothina

7:29 p.m.

New subject: [SciPy-user] Mapping a series of files.

There are some issues with converting to netcdf. Mainly the fact that there is no standard for unstructured grids in netcdf. Most of the tools work for structured grids. There have been a couple of attempts to come up with an unstructured grid netcdf standard but from what I can tell they petered out in 2006. We are struggling with this right now since we have a couple of different hydro models and are trying to define a common format so we can develop our analysis and vis tools. My present idea is to write a module that abstracts the details of each model format and allows me to load the data into python. Will your module work with unstructured grids? - dharhas

...
...
Charles Doutriaux <doutriaux1@llnl.gov> 8/7/2008 11:15 AM >>> Darhas,

It's your files can be converted to netcdf (or grib), then we have a tool to do exactly what you want basically you'd run cdscan -x full.xml *.nc And it would generate an xml file that would simulate being a full file then using our cdms2 read module you would do f=cdms2.open('full.xml') data =f("var",time=('2008-1','2008-7')) It would figure out for you which files to open. You could even be more restrictive by selecting a sub region (latitude=(-20,20)) etc... for more info: http://cdat.sf.net C. Dharhas Pothina wrote:

Hi,

I've been following the thread on 'partially reading a file' with some interest and have a related question.

So I have a series of large binary data files (1_data.dat, 2_data.dat, etc) that represent a 3D time series of data. Right now I am cycling through all the files reading the entire dataset to memory and extracting the subset I need. This works but is extremely memory hungry and slow and I'm running out of memory for datasets more than a year long. I could calculate which few files contain the data I need and only read those in but that is a bit cumbersome and also doesn't help if I need a 1d or 2d slice of the whole time period.

In the other thread Travis gave an example of using memmap to map a file to memory. Can I do this to with multiple files. ie use memmap to generate an array[x,y,z,t] that I can then use slicing to actually read what I need? Another complication is that each binary file has a header section and then a data section. By reading the first file I can calculate the offset for the data part of the file.

thanks,

- dharhas

_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http:// projects.scipy.org/mailman/listinfo/scipy-user

_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user

Reply

Sign in to reply online Use email software

fred

6 Aug 6 Aug

10:17 p.m.

New subject: [SciPy-user] partially reading a file...

Travis E. Oliphant a écrit :

Should work... It does !

Travis, as Gaël like to say, you are my hero :-))) Many many thanks. Cheers, -- Fred

Reply

Sign in to reply online Use email software

fred

10:56 p.m.

New subject: [SciPy-user] partially reading a file [corollary]

Now, let's say I have scatter data in a big binary file (stored in the form (xi, yi, zi, vi)), like on the snapshot, showing a "small" scatter. How can I cut the scatter efficiently in several files, as in the previous mail ? I can use memmap to "read" the whole file, but after ? It's more a algorithmic issue from my own point of view. TIA. Cheers, -- Fred

Reply

Sign in to reply online Use email software

fred

7 Aug 7 Aug

6:36 p.m.

New subject: [SciPy-user] partially reading a file...

Travis E. Oliphant a écrit :

Should work...

I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31 and I get the following message: File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line 193, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) ValueError: mmap length is greater than file size Is there a workaround to consider long integer (if this is the issue) ? TIA. Cheers, -- Fred

Reply

Sign in to reply online Use email software

Sebastian Haase

7:08 p.m.

New subject: [SciPy-user] partially reading a file...

On Thu, Aug 7, 2008 at 5:36 PM, fred <fredmfp@gmail.com> wrote:

Travis E. Oliphant a écrit :

...
Should work...

I have tested the trick on a file which has 2.7 10**9 nodes, ie > 2**31 and I get the following message:

File "/usr/local/lib/python2.5/site-packages/numpy/core/memmap.py", line 193, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) ValueError: mmap length is greater than file size

Is there a workaround to consider long integer (if this is the issue) ?

TIA.

Are you "really" on a 64-bit system ? Is this Linux ? Is your Python the original from the distro - or did you build it yourself ? Do a: >>> import sys;print sys.maxint Did you build numpy yourself or did you download a binary ? HTH, Sebastian

Reply

Sign in to reply online Use email software

fred

8 Aug 8 Aug

11:36 a.m.

New subject: [SciPy-user] partially reading a file...

Sebastian Haase a écrit :

Are you "really" on a 64-bit system ? Yes.

Is this Linux ? Yes.

Is your Python the original from the distro - or did you build it yourself ? Built myself.

Do a: >>> import sys;print sys.maxint I get the expected answer: 2**63-1.

Did you build numpy yourself or did you download a binary ? Built myself.

What's going on ? Cheers, -- Fred

Reply

Sign in to reply online Use email software

Sebastian Haase

12:27 p.m.

New subject: [SciPy-user] partially reading a file...

On Fri, Aug 8, 2008 at 10:36 AM, fred <fredmfp@gmail.com> wrote:

Sebastian Haase a écrit :

...
Are you "really" on a 64-bit system ? Yes.

...
Is this Linux ? Yes.

...
Is your Python the original from the distro - or did you build it yourself ? Built myself.

...
Do a: >>> import sys;print sys.maxint I get the expected answer: 2**63-1.

...
Did you build numpy yourself or did you download a binary ? Built myself.

What's going on ?

Don't know .... what is the size of the file you are trying to open again - in bytes ? What file system are you using (don't know if this is of any interest ....) ? -S.

Reply

Sign in to reply online Use email software

fred

12:35 p.m.

New subject: [SciPy-user] partially reading a file...

Sebastian Haase a écrit :

Don't know .... what is the size of the file you are trying to open again - in bytes ? -rw-r--r-- 1 fred users 5529600000 2008-08-07 16:31 input.sep

What file system are you using (don't know if this is of any interest ....) ? ext3

Cheers, -- Fred

Reply

Sign in to reply online Use email software

fred

1:17 p.m.

New subject: [SciPy-user] partially reading a file...

Sebastian Haase a écrit :

What file system are you using (don't know if this is of any interest ....) ? Hmmm, forget this thread.

A keyboard-to-chair interface problem. Sorry. Cheers, -- Fred

Reply

Sign in to reply online Use email software

Sebastian Haase

1:24 p.m.

New subject: [SciPy-user] partially reading a file...

On Fri, Aug 8, 2008 at 12:17 PM, fred <fredmfp@gmail.com> wrote:

Sebastian Haase a écrit :

...
What file system are you using (don't know if this is of any interest ....) ? Hmmm, forget this thread.

A keyboard-to-chair interface problem.

Sorry.

That's O.K.

...
...
5529600000 / 1024 / 1024 / 1024 5.14984130859

So you are saying you are mem-mapping a 5.2 GB file without problem !? That's pretty neat ;-) - Sebastian

Reply

Sign in to reply online Use email software

fred

3:11 p.m.

New subject: [SciPy-user] partially reading a file...

Sebastian Haase a écrit :

...
...
...
5529600000 / 1024 / 1024 / 1024 5.14984130859

So you are saying you are mem-mapping a 5.2 GB file without problem !?

That's pretty neat ;-) Dimensions were wrong in my code, yes. For 1200x1600x720, it works fine ;-)

Cheers, -- Fred

Reply

Sign in to reply online Use email software

5810

Age (days ago)

5812

Last active (days ago)

Download

22 comments

5 participants

tags

participants (5)

Charles Doutriaux
Dharhas Pothina
fred
Sebastian Haase
Travis E. Oliphant