Mailman 3 Change in memmap behaviour - NumPy-Discussion

2 Jul 2012

      Hi,

We are developing a large project for genome analysis (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data structure for storage. The stored data are accessed in slices, and used as basis for calculations. As the stored data may be large (up to 24 GB), the memory footprint is important. 

We experienced a problem with 64-bit addressing for the function concatenate (using quite old numpy version 1.5.1rc), and have thus updated the version of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, experienced another problem connected to a change in memmap behaviour. This change seems to have come with the 1.6 release.

Before (1.5.1rc1):
...
...
...
import platform; print platform.python_version()
2.7.0
import numpy as np
np.version.version
'1.5.1rc1'
a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
a[:] = 2
a[0:2]
memmap([2, 2], dtype=int32)
a[0:2]._mmap

a.sum()
40
a.sum()._mmap
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '_mmap'
After (1.6.2):
...
...
...
import platform; print platform.python_version()
2.7.0
import numpy as np
np.version.version
'1.6.2'
a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
a[:] = 2
a[0:2]
memmap([2, 2], dtype=int32)
a[0:2]._mmap

a.sum()
memmap(40)
a.sum()._mmap

The problem is then that doing calculations of memmap objects, resulting in scalar results, previously returned a numpy scalar, with no reference to the memmap object. We could then just keep the result, and mark the memmap for garbage collection. Now, the memory usage of the system has increased dramatically, as we now longer have this option.

So, the question is twofold:

1) What is the reason behind this change? It makes sense to keep the reference to the mmap when slicing, but to go from a scalar value to the mmap does not seem very useful. Is there a possibility to return to the old solution?
2) If not, do you have any advice how we can retain the old solution without rewriting the system. We could cast the results of all functions on the memmap, but these are scattered throughout the system and would probably cause much headache. So we would rather implement a general solution, for instance wrapping the memmap object somehow. Do you have any ideas?

Connected to this is the rather puzzling fact that the 'new' memmap scalar object has an __iter__ method, but no length. Should not the __iter__ method be removed, as this signals that the object is iterable?

Before (1.5.1rc1):
...
...
...
a[0:2].__iter__()

len(a[0:2])
2
a.sum().__iter__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '__iter__'
len(a.sum())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'numpy.int64' has no len()
After (1.6.2):
...
...
...
a[0:2].__iter__()

len(a[0:2])          
2
a.sum().__iter__

len(a.sum())        
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: len() of unsized object
[x for x in a.sum()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iteration over a 0-d array
Regards,
Sveinung Gundersen

Change in memmap behaviour

Sveinung Gundersen

Nathaniel Smith

tags

participants (2)