[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Tue Mar 1 18:24:42 EST 2011

On Tue, Mar 1, 2011 at 17:06, Jon Olav Vik <jonovik at gmail.com> wrote:
> Robert Kern <robert.kern <at> gmail.com> writes:
>> >> It's up to the virtual memory manager, but usually, it will just load
>> >> those pages (chunks the size of mmap.PAGESIZE) that are touched by
>> >> your request and write them back.
>> >
>> > What if two processes touch adjacent chunks that are smaller than a page? Is
>> > there a risk that writing back an entire page will overwrite the efforts of
>> > another process?
>>
>> I believe that there is only one page in main memory. Each process is
>> simply pointed to the same page. As long as you don't write to the
>> same specific byte, you'll be fine.
>
> Within a single machine, that sounds fine. What about processes running on
> different nodes, with different main memories?

You mean mmaping a file on a shared file system? Then it's up the file
system. I'm honestly not sure what would happen for your particular
file system. Try it and report back.

In any case, using the offset won't help. The virtual memory manager
always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned
with the start of the file. Under the covers, np.memmap() rounds the
offset down to the nearest page boundary and then readjusts the
pointer.

For performance reasons, I don't recommend doing it anyway. The
networked file system becomes the bottleneck, in my experience.

>> > Pardon me if I misunderstand, but isn't that what np.load does already,
> with or
>> > without my modifications?
>>
>> With your modifications, the user does not get to see the header
>> information before they pick the offset and shape. I contend that the
>> user ought to read the shape information before deciding the shape to
>> use.
>
> Actually, that is what I've done for my own use (trivial parallelism, where I
> know that axis 0 is "long" and suitable for dividing the workload): Read the
> shape first, divide its first dimension into chunks with np.array_split(), then
> memmap the portion I need. I didn't submit that function for inclusion because
> it is rather specific to my own work. For process "ID" out of "NID", the code
> is roughly as follows:
>
> def memmap_chunk(filename, ID, NID, mode="r"):
>    r = open_memmap(filename, "r")
>    n = r.shape[0]
>    i = np.array_split(range(n), NID)[ID]
>    offset = i[0]
>    shape = 1 + i[-1] - i[0]
>    if len(i) > 0:
>        return open_memmap(filename, mode=mode, offset=offset, shape=shape)
>    else:
>        return np.empty(0, r.dtype)
>
>> I don't think that changing the no.load() API is the best way to
>> solve this problem.
>
> I can agree with that. What I actually use is open_memmap() as shown above, but
> couldn't have done it without offset and shape arguments.
>
> In retrospect, changing np.load() was maybe a misstep in trying to generalize
> from my own hacks to something that might be useful to others. I kind of added
> offset and shape to np.load "for completeness", as it offers a mmap_mode
> argument but no way to memory-map just a portion of a file.
>
> So to attempt a summary: memory-mapping with np.load may be useful to conserve
> memory in a single process (with no need for offset and shape arguments), but
> splitting workload across multiple processes is best done with open_memmap.
> Then I humbly suggest that having offset and shape arguments to open_memmap is
> useful.

I disagree. The important bit is to get the header information and the
data offset out of the file without loading any data. Once you have
that, np.memmap() suffices. You don't need to alter np.open_memmap()
at all. In fact, if you do use np.open_memmap() to read the
information, then you can't implement your "64-bit-large file on a
32-bit machine" use case.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco