[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Ralf Gommers ralf.gommers at googlemail.com
Mon Feb 28 02:15:48 EST 2011


Hi Jon,

Thanks for the patch, and sorry for the slow reply.

On Thu, Feb 24, 2011 at 11:49 PM, Jon Olav Vik <jonovik at gmail.com> wrote:
> https://github.com/jonovik/numpy/compare/master...offset_memmap
>
> The `offset` argument to np.memmap enables memory-mapping a portion of a file
> on disk to a memory-mapped Numpy array. Memory-mapping can also be done with
> np.load, but it uses np.lib.format.open_memmap, which has no offset argument.
>
> I have added an offset argument to np.lib.format.open_memmap and np.load as
> detailed in the link above, and humbly submit the changes for review. This is
> my first time using git, apologies for any mistakes.

My first question after looking at this is why we would want three
very similar ways to load memory-mapped arrays (np.memmap, np.load,
np.lib.format.open_memmap)? They already exist but your changes make
those three even more similar.

I'd think we want one simple version (load) and a full-featured one.
So imho changing open_memmap but leaving np.load as-is would be the
way to go.

> Note that the offset is in terms of array elements, not bytes (which is what
> np.memmap uses), because that was what I had use for.

This should be kept in bytes like in np.memmap I think, it's just
confusing if those two functions differ like that.

Another thing to change: you should not use an assert statement, use
"if ....: raise ...." instead.

> My use case was to preallocate a big record array on disk, then start many
> processes writing to their separate, memory-mapped segments of the file. The
> end result was one big array on disk, with the correct shape and data type
> information. Using a record array makes the data structure more self-
> documenting. Using open_memmap with mode="w+" is the fastest way I've found to
> preallocate an array on disk; it does not create the huge array in memory.
> Letting multiple processes memory-map and read/write to non-overlapping
> portions without interfering with each other allows for fast, simple parallel I/
> O.
>
> I've used this extensively on Numpy 1.4.0, but based my Git checkout on the
> current Numpy trunk. There have been some rearrangements in np.load since then
> (it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can
> see, my modifications carry over fine. I haven't had a chance to test with
> Numpy trunk, though. (What is the best way to set up a test version without
> affecting my working 1.4.0 setup?)

You can use an in-place build
(http://projects.scipy.org/numpy/wiki/DevelopmentTips) and add that
dir to your PYTHONPATH.

Cheers,
Ralf



More information about the NumPy-Discussion mailing list