[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Thu Feb 24 10:49:06 EST 2011

https://github.com/jonovik/numpy/compare/master...offset_memmap

The `offset` argument to np.memmap enables memory-mapping a portion of a file 
on disk to a memory-mapped Numpy array. Memory-mapping can also be done with 
np.load, but it uses np.lib.format.open_memmap, which has no offset argument.

I have added an offset argument to np.lib.format.open_memmap and np.load as 
detailed in the link above, and humbly submit the changes for review. This is 
my first time using git, apologies for any mistakes.

Note that the offset is in terms of array elements, not bytes (which is what 
np.memmap uses), because that was what I had use for. Also, I added a `shape` 
to np.load to memory-map only a portion of a file.

My use case was to preallocate a big record array on disk, then start many 
processes writing to their separate, memory-mapped segments of the file. The 
end result was one big array on disk, with the correct shape and data type 
information. Using a record array makes the data structure more self-
documenting. Using open_memmap with mode="w+" is the fastest way I've found to 
preallocate an array on disk; it does not create the huge array in memory. 
Letting multiple processes memory-map and read/write to non-overlapping 
portions without interfering with each other allows for fast, simple parallel I/
O.

I've used this extensively on Numpy 1.4.0, but based my Git checkout on the 
current Numpy trunk. There have been some rearrangements in np.load since then 
(it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can 
see, my modifications carry over fine. I haven't had a chance to test with 
Numpy trunk, though. (What is the best way to set up a test version without 
affecting my working 1.4.0 setup?)

Hope this can be useful,
Jon Olav Vik