Re: [Numpy-discussion] .transpose() of memmap array fails to close()

17 Aug 2007

      On 16/08/07, Glen W. Mabey  wrote:
...
On Wed, Aug 15, 2007 at 08:50:28PM -0400, Anne Archibald wrote:
...
But to be pythonic, or numpythonic, when the original A is
garbage-collected, the garbage collection should certainly close the
mmap.
Humm, this would be less than ideal for my use case, when the data on
disk is organized in a different dimensional order than I want to refer
to it in my code.  For example:
p_data = numpy.memmap( datafilename, shape=( 10, 1024, 20 ), dtype=numpy.float32, mode='r')
u_data = p_data.transpose( [ 2, 0, 1 ] )
and I don't want to have to keep track of p_data because its only u_data
that I care about and want to use.  And I promise, this is not a
contrived example.  I have data that I really do want to be ordered in a
certain way on disk, for I/O efficiency reasons, yet when I logically
index into it in my code, I want the dimensions to be in a different
order.
Perfectly reasonable. Note that p_data cannot be collected until
u_data goes away too, so the mmap is safe. And transpose()ing doesn't
copy any data, so even if you get an ndarray, you haven't lost the
ability to modify things on disk.
...
...
Being able to apply flush() or whatever to slices is not necessarily
unpythonic, but it's probably a lot simpler to reliably implement
slices of mmap()s as simple slices of ordinary arrays.
I considered this approach, but what happens if you want to instantiate
a slice that is very large, e.g., larger than the size of your physical
RAM?  In that case, you can't afford to make simple slices be ordinary
arrays, besides the case where you want to change values.  Making them
functionally memmap-arrays, but without .sync() and .close() doesn't
seem right either.
I was a bit ambiguous. An ordinary numpy array is an ndarray object,
which contains some housekeeping data (dimension, shape, stride
lengths, some flags, what have you) and a pointer to a hunk of memory.
That hunk of memory can be pretty much any directly-addressable
memory, for example a contiguous block of malloc()ed RAM, the
beginning of a (possibly strided) subblock of an existing piece of
malloc()ed RAM, a pointer to an array statically allocated in some C
or Fortran library... or a piece of memory in an mmap()ed region.
Numpy doesn't care at all about the difference. In fact this is the
beauty of numpy: because all it cares about is where the elements
start, what they look like, how many there are, and how far apart they
are, it can cheaply create subarrays without copying any data.

So naively, one might implement mmap()ed arrays with a factory
function that called mmap(), got back a pointer to the place in
virtual memory where the file's contents appear to live, and whipped
up a perfectly ordinary ndarray to point to the contents. It would
work, thanks to the magic of the OS's mmap() call. The only problem is
you would have to figure out when it was safe to close the mmap()
(invalidating the array's memory!) and you would have no convenient
way to flush() the mmap() out to disk.

So the mmap() objects exist. All they are is ndarrays that keep track
of how the mmap() was done and provide flush() and close() methods;
they also (I hope!) make sure close() gets called when they get
garbage-collected. Note that the safety-scissors way to do this would
be to *not* provide a close() method, since a close() leaves the
object's data unusable, just waiting for an unwise attempt to index
into the object. It's probably better not to ever close() an mmap()
object.

What should happen when you take a slice of an mmap() object? (this
includes transposes and other non-copying ways to get at its
contents). You get a fresh new ndarray object that does all the numpy
magic. But should it also do the mmap() magic? It doesn't need the
mmap() creation magic, since the mmap() already exists. flush() would
be sort of nice, since that's meaningful (though it might take a long
time, if it flushes the whole mmap). close() is just asking to shoot
yourself in the foot, since it not only invalidates the slice you took
but the whole mmap()!

It seems to me - remember, I don't use mmap or develop numpy, so give
this opinion the corresponding weight - that the Right Answer for
mmap() is to provide flush(), but not to provide close() except on
finalization (you can ensure finalization happens by deleting all
references to the array). Finally, if you take a slice of an mmap(), I
think you should get a simple ndarray. This ensures you don't have to
thread type-duplication code into everywhere that might make a slice.
But if you do make slices themselves mmap()s, providing flush() to
slices too, great. Just don't provide close(), and particularly
*don't* invoke it on finalization of slices, or things will die
horribly.

Anne