[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Robert Kern robert.kern at gmail.com
Tue Mar 1 17:04:54 EST 2011

On Tue, Mar 1, 2011 at 15:36, Jon Olav Vik <jonovik at gmail.com> wrote:
> Robert Kern <robert.kern <at> gmail.com> writes:
>> >> >> You can have each of those processes memory-map the whole file and
>> >> >> just operate on their own slices. Your operating system's virtual
>> >> >> memory manager should handle all of the details for you.
>> >
>> > Wow, I didn't know that. So as long as the ranges touched by each process do
>> > not overlap, I'll be safe? If I modify only a few discontiguous chunks in a
>> > range, will the virtual memory manager decide whether it is most efficient
> to
>> > write just the chunks or the entire range back to disk?
>> It's up to the virtual memory manager, but usually, it will just load
>> those pages (chunks the size of mmap.PAGESIZE) that are touched by
>> your request and write them back.
> What if two processes touch adjacent chunks that are smaller than a page? Is
> there a risk that writing back an entire page will overwrite the efforts of
> another process?

I believe that there is only one page in main memory. Each process is
simply pointed to the same page. As long as you don't write to the
same specific byte, you'll be fine.

>> > Use case: Generate "large" output for "many" parameter scenarios.
>> > 1. Preallocate "enormous" output file on disk.
>> > 2. Each process fills in part of the output.
>> > 3. Analyze, aggregate results, perhaps save to HDF or database, in a
> sliding-
>> > window fashion using a memory-mapped array. The aggregated results fit in
>> > memory, even though the raw output doesn't.
> [...]
>> Okay, in this case, I don't think that just adding an offset argument
>> to np.load() is very useful. You will want to read the dtype and shape
>> information from the header, *then* decide what offset and shape to
>> use for the memory-mapped segment. You will want to use the functions
>> read_magic() and read_array_header_1_0() from np.lib.format directly.
> Pardon me if I misunderstand, but isn't that what np.load does already, with or
> without my modifications?

With your modifications, the user does not get to see the header
information before they pick the offset and shape. I contend that the
user ought to read the shape information before deciding the shape to
use. I don't think that changing the no.load() API is the best way to
solve this problem.

Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

More information about the NumPy-Discussion mailing list