[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Mon Feb 28 18:35:22 EST 2011

Ralf Gommers <ralf.gommers <at> googlemail.com> writes:

> My first question after looking at this is why we would want three
> very similar ways to load memory-mapped arrays (np.memmap, np.load,
> np.lib.format.open_memmap)? They already exist but your changes make
> those three even more similar.

If I understand correctly, np.memmap requires you to specify the dtype. It 
cannot of itself memory-map a file that has been saved with e.g. np.save. A 
file with (structured, in my case) dtype information is much more self-
documenting than a plain binary file with no dtype. Functions for further 
processing of the data need then only read the file to know how to interpret it.

> I'd think we want one simple version (load) and a full-featured one.
> So imho changing open_memmap but leaving np.load as-is would be the
> way to go.

np.load calls open_memmap if mmap_mode is specified. The only change I made to 
np.load was add offset and shape parameters that are passed through to 
open_memmap. (A shape argument is required for offset to be really useful, at 
least for my use case of multiple processes memory-mapping their own portion of 
a file.)

> > Note that the offset is in terms of array elements, not bytes (which is what
> > np.memmap uses), because that was what I had use for.
> 
> This should be kept in bytes like in np.memmap I think, it's just
> confusing if those two functions differ like that.

I agree that there is room for confusion, but it is quite inconvenient having 
to access the file once for the dtype, then compute the offset based on the 
item size of the dtype, then access the file again for the real memory-mapping.

The single most common scenario for me is "I am process n out of N, and I will 
memory-map my fair share of this file". Given n and N, I can compute the offset 
and shape in terms of array elements, but converting it to bytes means a couple 
extra lines of code every time I do it.

(If anything, I'd prefer the offset argument for np.memmap to be in elements 
too, in accordance with how indexing and striding works with Numpy arrays.)

> Another thing to change: you should not use an assert statement, use
> "if ....: raise ...." instead.

Will do if this gets support. Thanks for the feedback 8-)

> > My use case was to preallocate a big record array on disk, then start many
> > processes writing to their separate, memory-mapped segments of the file. The
> > end result was one big array on disk, with the correct shape and data type
> > information. Using a record array makes the data structure more self-
> > documenting. Using open_memmap with mode="w+" is the fastest way I've found 
to
> > preallocate an array on disk; it does not create the huge array in memory.
> > Letting multiple processes memory-map and read/write to non-overlapping
> > portions without interfering with each other allows for fast, simple 
parallel I/
> > O.
> >
> > I've used this extensively on Numpy 1.4.0, but based my Git checkout on the
> > current Numpy trunk. There have been some rearrangements in np.load since 
then
> > (it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can
> > see, my modifications carry over fine. I haven't had a chance to test with
> > Numpy trunk, though. (What is the best way to set up a test version without
> > affecting my working 1.4.0 setup?)
> 
> You can use an in-place build
> (http://projects.scipy.org/numpy/wiki/DevelopmentTips) and add that
> dir to your PYTHONPATH.

Very helpful. Thanks again!

Regards,
Jon Olav