[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Tue Mar 1 11:07:00 EST 2011

On Tue, Mar 1, 2011 at 07:20, Jon Olav Vik <jonovik at gmail.com> wrote:
> Robert Kern <robert.kern <at> gmail.com> writes:
>
>> On Mon, Feb 28, 2011 at 18:50, Sturla Molden <sturla <at> molden.no> wrote:
>> > Den 01.03.2011 01:15, skrev Robert Kern:
>> >> You can have each of those processes memory-map the whole file and
>> >> just operate on their own slices. Your operating system's virtual
>> >> memory manager should handle all of the details for you.
>
> Wow, I didn't know that. So as long as the ranges touched by each process do
> not overlap, I'll be safe? If I modify only a few discontiguous chunks in a
> range, will the virtual memory manager decide whether it is most efficient to
> write just the chunks or the entire range back to disk?

It's up to the virtual memory manager, but usually, it will just load
those pages (chunks the size of mmap.PAGESIZE) that are touched by
your request and write them back.

>> > Mapping large files from the start will not always work on 32-bit
>> > systems. That is why mmap.mmap take an offset argument now (Python 2.7
>> > and 3.1.)
>> >
>> > Making a view np.memmap with slices is useful on 64-bit but not 32-bit
>> > systems.
>>
>> I'm talking about the OP's stated use case where he generates the file
>> via memory-mapping the whole thing on the same machine. The whole file
>> does fit into the address space in his use case.
>>
>> I'd like to see a real use case where this does not hold. I suspect
>> that this is not the API we would want for such use cases.
>
> Use case: Generate "large" output for "many" parameter scenarios.
> 1. Preallocate "enormous" output file on disk.
> 2. Each process fills in part of the output.
> 3. Analyze, aggregate results, perhaps save to HDF or database, in a sliding-
> window fashion using a memory-mapped array. The aggregated results fit in
> memory, even though the raw output doesn't.
>
> My real work has been done on a 64-bit cluster running 64-bit Python, but I'd
> like to have the option of post-processing on my laptop's 32-bit Python (either
> spending a few hours copying the file to my laptop first, or mounting the
> remote disk using e.g. ExpanDrive).

Okay, in this case, I don't think that just adding an offset argument
to np.load() is very useful. You will want to read the dtype and shape
information from the header, *then* decide what offset and shape to
use for the memory-mapped segment. You will want to use the functions
read_magic() and read_array_header_1_0() from np.lib.format directly.
You can slightly modify the logic in open_memmap():

        # Read the header of the file first.
        fp = open(filename, 'rb')
        try:
            version = read_magic(fp)
            if version != (1, 0):
                msg = "only support version (1,0) of file format, not %r"
                raise ValueError(msg % (version,))
            shape, fortran_order, dtype = read_array_header_1_0(fp)
            if dtype.hasobject:
                msg = "Array can't be memory-mapped: Python objects in dtype."
                raise ValueError(msg)
            offset = fp.tell()
        finally:
            fp.close()

        chunk_offset, chunk_shape = decide_offset_shape(dtype, shape,
fortran_order, offset)

        marray = np.memmap(filename, dtype=dtype, shape=chunk_shape,
            order=('F' if fortran_order else 'C'), mode='r+',
offset=chunk_offset)

What might help is combining the first stanza of logic together into
one read_header() function that returns the usual information and also
the offset to the actual data. That lets you avoid replicating the
logic for handling different format versions.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco