Hi, We are developing a large project for genome analysis (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data structure for storage. The stored data are accessed in slices, and used as basis for calculations. As the stored data may be large (up to 24 GB), the memory footprint is important. We experienced a problem with 64-bit addressing for the function concatenate (using quite old numpy version 1.5.1rc), and have thus updated the version of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, experienced another problem connected to a change in memmap behaviour. This change seems to have come with the 1.6 release. Before (1.5.1rc1):
import platform; print platform.python_version() 2.7.0 import numpy as np np.version.version '1.5.1rc1' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] memmap([2, 2], dtype=int32) a[0:2]._mmap
a.sum() 40 a.sum()._mmap Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'numpy.int64' object has no attribute '_mmap'
After (1.6.2):
import platform; print platform.python_version() 2.7.0 import numpy as np np.version.version '1.6.2' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] memmap([2, 2], dtype=int32) a[0:2]._mmap
a.sum() memmap(40) a.sum()._mmap
The problem is then that doing calculations of memmap objects, resulting in scalar results, previously returned a numpy scalar, with no reference to the memmap object. We could then just keep the result, and mark the memmap for garbage collection. Now, the memory usage of the system has increased dramatically, as we now longer have this option. So, the question is twofold: 1) What is the reason behind this change? It makes sense to keep the reference to the mmap when slicing, but to go from a scalar value to the mmap does not seem very useful. Is there a possibility to return to the old solution? 2) If not, do you have any advice how we can retain the old solution without rewriting the system. We could cast the results of all functions on the memmap, but these are scattered throughout the system and would probably cause much headache. So we would rather implement a general solution, for instance wrapping the memmap object somehow. Do you have any ideas? Connected to this is the rather puzzling fact that the 'new' memmap scalar object has an __iter__ method, but no length. Should not the __iter__ method be removed, as this signals that the object is iterable? Before (1.5.1rc1):
a[0:2].__iter__()
len(a[0:2]) 2 a.sum().__iter__ Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'numpy.int64' object has no attribute '__iter__' len(a.sum()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'numpy.int64' has no len()
After (1.6.2):
a[0:2].__iter__()
len(a[0:2]) 2 a.sum().__iter__ len(a.sum()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: len() of unsized object [x for x in a.sum()] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: iteration over a 0-d array
Regards, Sveinung Gundersen
On Mon, Jul 2, 2012 at 3:53 PM, Sveinung Gundersen
Hi,
We are developing a large project for genome analysis (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data structure for storage. The stored data are accessed in slices, and used as basis for calculations. As the stored data may be large (up to 24 GB), the memory footprint is important.
We experienced a problem with 64-bit addressing for the function concatenate (using quite old numpy version 1.5.1rc), and have thus updated the version of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, experienced another problem connected to a change in memmap behaviour. This change seems to have come with the 1.6 release.
Before (1.5.1rc1):
import platform; print platform.python_version() 2.7.0 import numpy as np np.version.version '1.5.1rc1' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] memmap([2, 2], dtype=int32) a[0:2]._mmap
a.sum() 40 a.sum()._mmap Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'numpy.int64' object has no attribute '_mmap' After (1.6.2):
import platform; print platform.python_version() 2.7.0 import numpy as np np.version.version '1.6.2' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] memmap([2, 2], dtype=int32) a[0:2]._mmap
a.sum() memmap(40) a.sum()._mmap The problem is then that doing calculations of memmap objects, resulting in scalar results, previously returned a numpy scalar, with no reference to the memmap object. We could then just keep the result, and mark the memmap for garbage collection. Now, the memory usage of the system has increased dramatically, as we now longer have this option.
Your actual memory usage may not have increased as much as you think, since memmap objects don't necessarily take much memory -- it sounds like you're leaking virtual memory, but your resident set size shouldn't go up as much. That said, this is clearly a bug, and it's even worse than you mention -- *all* operations on memmap arrays are holding onto references to the original mmap object, regardless of whether they share any memory:
a = np.memmap("/etc/passwd", np.uint8, "r") # arithmetic (a + 10)._mmap is a._mmap True # fancy indexing (doesn't return a view!) a[[1, 2, 3]]._mmap is a._mmap True a.sum()._mmap is a._mmap True Really, only slicing should be returning a np.memmap object at all. Unfortunately, it is currently impossible to create an ndarray subclass that returns base-class ndarrays from any operations -- __array_finalize__() has no way to do this. And this is the third ndarray subclass in a row that I've looked at that wanted to be able to do this, so I guess maybe it's something we should implement...
In the short term, the numpy-upstream fix is to change numpy.core.memmap:memmap.__array_finalize__ so that it only copies over the ._mmap attribute of its parent if np.may_share_memory(self, parent) is True. Patches gratefully accepted ;-) In the short term, you have a few options for hacky workarounds. You could monkeypatch the above fix into the memmap class. You could manually assign None to the _mmap attribute of offending arrays (being careful only to do this to arrays where you know it is safe!). And for reduction operations like sum() in particular, what you have right now is not actually a scalar object -- it is a 0-dimensional array that holds a single scalar. You can pull this scalar out by calling .item() on the array, and then throw away the array itself -- the scalar won't have any _mmap attribute. def scalarify(scalar_or_0d_array): if isinstance(scalar_or_0d_array, np.ndarray): return scalar_or_0d_array.item() else: return scalar_or_0d_array # works on both numpy 1.5 and numpy 1.6: total = scalarify(a.sum()) -N
participants (2)
-
Nathaniel Smith
-
Sveinung Gundersen