[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Tue Mar 1 18:06:51 EST 2011

Robert Kern <robert.kern <at> gmail.com> writes:
> >> It's up to the virtual memory manager, but usually, it will just load
> >> those pages (chunks the size of mmap.PAGESIZE) that are touched by
> >> your request and write them back.
> >
> > What if two processes touch adjacent chunks that are smaller than a page? Is
> > there a risk that writing back an entire page will overwrite the efforts of
> > another process?
> 
> I believe that there is only one page in main memory. Each process is
> simply pointed to the same page. As long as you don't write to the
> same specific byte, you'll be fine.

Within a single machine, that sounds fine. What about processes running on 
different nodes, with different main memories?

> > Pardon me if I misunderstand, but isn't that what np.load does already, 
with or
> > without my modifications?
> 
> With your modifications, the user does not get to see the header
> information before they pick the offset and shape. I contend that the
> user ought to read the shape information before deciding the shape to
> use.

Actually, that is what I've done for my own use (trivial parallelism, where I 
know that axis 0 is "long" and suitable for dividing the workload): Read the 
shape first, divide its first dimension into chunks with np.array_split(), then 
memmap the portion I need. I didn't submit that function for inclusion because 
it is rather specific to my own work. For process "ID" out of "NID", the code 
is roughly as follows:

def memmap_chunk(filename, ID, NID, mode="r"):
    r = open_memmap(filename, "r")
    n = r.shape[0]
    i = np.array_split(range(n), NID)[ID]
    offset = i[0]
    shape = 1 + i[-1] - i[0]
    if len(i) > 0:
        return open_memmap(filename, mode=mode, offset=offset, shape=shape)
    else:
        return np.empty(0, r.dtype)

> I don't think that changing the no.load() API is the best way to
> solve this problem.

I can agree with that. What I actually use is open_memmap() as shown above, but 
couldn't have done it without offset and shape arguments.

In retrospect, changing np.load() was maybe a misstep in trying to generalize 
from my own hacks to something that might be useful to others. I kind of added 
offset and shape to np.load "for completeness", as it offers a mmap_mode 
argument but no way to memory-map just a portion of a file.

So to attempt a summary: memory-mapping with np.load may be useful to conserve 
memory in a single process (with no need for offset and shape arguments), but 
splitting workload across multiple processes is best done with open_memmap. 
Then I humbly suggest that having offset and shape arguments to open_memmap is 
useful.