On 16/08/07, Glen W. Mabey
On Wed, Aug 15, 2007 at 08:50:28PM -0400, Anne Archibald wrote:
But to be pythonic, or numpythonic, when the original A is garbage-collected, the garbage collection should certainly close the mmap.
Humm, this would be less than ideal for my use case, when the data on disk is organized in a different dimensional order than I want to refer to it in my code. For example:
p_data = numpy.memmap( datafilename, shape=( 10, 1024, 20 ), dtype=numpy.float32, mode='r') u_data = p_data.transpose( [ 2, 0, 1 ] )
and I don't want to have to keep track of p_data because its only u_data that I care about and want to use. And I promise, this is not a contrived example. I have data that I really do want to be ordered in a certain way on disk, for I/O efficiency reasons, yet when I logically index into it in my code, I want the dimensions to be in a different order.
Perfectly reasonable. Note that p_data cannot be collected until u_data goes away too, so the mmap is safe. And transpose()ing doesn't copy any data, so even if you get an ndarray, you haven't lost the ability to modify things on disk.
Being able to apply flush() or whatever to slices is not necessarily unpythonic, but it's probably a lot simpler to reliably implement slices of mmap()s as simple slices of ordinary arrays.
I considered this approach, but what happens if you want to instantiate a slice that is very large, e.g., larger than the size of your physical RAM? In that case, you can't afford to make simple slices be ordinary arrays, besides the case where you want to change values. Making them functionally memmap-arrays, but without .sync() and .close() doesn't seem right either.
I was a bit ambiguous. An ordinary numpy array is an ndarray object, which contains some housekeeping data (dimension, shape, stride lengths, some flags, what have you) and a pointer to a hunk of memory. That hunk of memory can be pretty much any directly-addressable memory, for example a contiguous block of malloc()ed RAM, the beginning of a (possibly strided) subblock of an existing piece of malloc()ed RAM, a pointer to an array statically allocated in some C or Fortran library... or a piece of memory in an mmap()ed region. Numpy doesn't care at all about the difference. In fact this is the beauty of numpy: because all it cares about is where the elements start, what they look like, how many there are, and how far apart they are, it can cheaply create subarrays without copying any data. So naively, one might implement mmap()ed arrays with a factory function that called mmap(), got back a pointer to the place in virtual memory where the file's contents appear to live, and whipped up a perfectly ordinary ndarray to point to the contents. It would work, thanks to the magic of the OS's mmap() call. The only problem is you would have to figure out when it was safe to close the mmap() (invalidating the array's memory!) and you would have no convenient way to flush() the mmap() out to disk. So the mmap() objects exist. All they are is ndarrays that keep track of how the mmap() was done and provide flush() and close() methods; they also (I hope!) make sure close() gets called when they get garbage-collected. Note that the safety-scissors way to do this would be to *not* provide a close() method, since a close() leaves the object's data unusable, just waiting for an unwise attempt to index into the object. It's probably better not to ever close() an mmap() object. What should happen when you take a slice of an mmap() object? (this includes transposes and other non-copying ways to get at its contents). You get a fresh new ndarray object that does all the numpy magic. But should it also do the mmap() magic? It doesn't need the mmap() creation magic, since the mmap() already exists. flush() would be sort of nice, since that's meaningful (though it might take a long time, if it flushes the whole mmap). close() is just asking to shoot yourself in the foot, since it not only invalidates the slice you took but the whole mmap()! It seems to me - remember, I don't use mmap or develop numpy, so give this opinion the corresponding weight - that the Right Answer for mmap() is to provide flush(), but not to provide close() except on finalization (you can ensure finalization happens by deleting all references to the array). Finally, if you take a slice of an mmap(), I think you should get a simple ndarray. This ensures you don't have to thread type-duplication code into everywhere that might make a slice. But if you do make slices themselves mmap()s, providing flush() to slices too, great. Just don't provide close(), and particularly *don't* invoke it on finalization of slices, or things will die horribly. Anne