[Numpy-discussion] Views of memmaps and offset

Sun Sep 23 14:51:46 EDT 2012

2012/9/23 Nathaniel Smith <njs at pobox.com>:
> On Sat, Sep 22, 2012 at 4:46 PM, Olivier Grisel
> <olivier.grisel at ensta.org> wrote:
>> There is also a third use case that is problematic on numpy master:
>>
>> orig = np.memmap('tmp.mmap', dtype=np.float64, shape=100, mode='w+')
>> orig[:] = np.arange(orig.shape[0]) * -1.0  # negative markers to
>> detect under / overflows
>>
>> a = np.memmap('tmp.mmap', dtype=np.float64, shape=50, mode='r+', offset=16)
>> a[:] = np.arange(50)
>> b = np.asarray(a[10:])
>>
>> Now b does not even have a 'filename' attribute anymore. `b.base` is a
>> python mmap instance but the later is created with a file descriptor.
>>
>> It would still be possible to use:
>>
>> from _multiprocessing import address_of_buffer
>>
>> to find the memory address of the mmap buffer and use than to open new
>> buffer views on the same memory segment from subprocesses using
>> `numpy.frombuffer((ctypes.c_byte * n_byte).fromaddress(addr))` but in
>> case of failure (e.g. the file has been deleted on the HDD) one gets a
>> segmentation fault instead of a much more userfriendly catchable file
>> not found exception.
>
> On Unix, if the processes are related in a way that lets this work,
> then this would actually be a far better solution... it will always
> refer to the same file that was opened in the parent, even if it's has
> since been deleted or renamed or replaced by a different file. (And if
> they aren't related by fork(), then sending the fd would be better
> than sending the filename, for the same reason.) Of course that
> doesn't help for Windows; no idea what happens there.
>
> Numpy in general really does not provide any reliable way of tracking
> the relationship between different views of the same buffer.
> Introspecting on .base will work in many cases, but it's not
> guaranteed to even in earlier versions. Maybe you don't care because
> it works well enough but it's an inherently rickety design :-). Trying
> to think of the correct solution here, I think it would have to be
> something like... have the numpy mmap code keep a global scorecard of
> all extant  memory mappings -- filename, offset, length, memory
> address. And then when you want to do an "mmap aware pickle", you
> check the address of the array you're trying to save to see if it
> falls into an mmap'ed region. That'd be simpler and more reliable than
> anything involving base tracking.

Well, base tracking seems to work really well on 1.6.2. Here is the
code that does the introspection / reconstruction of shared memory
views from sub-process using the python multiprocessing Pool API:

https://github.com/joblib/joblib/pull/44/files#L5R55

The only clean solution for the collapsed base of numpy 1.7 I see
would be to replace the direct mmap.mmap buffer instance from the
numpy.memmap class to use a custom wrapper of mmap.mmap that would
still implement the buffer python API but would also store the
filename and offset as additional attributes. To me that sounds like a
much cleaner than a "global scorecard of all extant memory mappings".

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel