[Numpy-discussion] Change in memmap behaviour

Mon Jul 2 10:53:43 EDT 2012

Hi,

We are developing a large project for genome analysis (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data structure for storage. The stored data are accessed in slices, and used as basis for calculations. As the stored data may be large (up to 24 GB), the memory footprint is important. 

We experienced a problem with 64-bit addressing for the function concatenate (using quite old numpy version 1.5.1rc), and have thus updated the version of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, experienced another problem connected to a change in memmap behaviour. This change seems to have come with the 1.6 release.

Before (1.5.1rc1):

>>> import platform; print platform.python_version()
2.7.0
>>> import numpy as np
>>> np.version.version
'1.5.1rc1'
>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>> a[:] = 2
>>> a[0:2]
memmap([2, 2], dtype=int32)
>>> a[0:2]._mmap
<mmap.mmap object at 0x3c246f8>
>>> a.sum()
40
>>> a.sum()._mmap
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '_mmap'

After (1.6.2):
>>> import platform; print platform.python_version()
2.7.0
>>> import numpy as np
>>> np.version.version
'1.6.2'
>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>> a[:] = 2
>>> a[0:2]
memmap([2, 2], dtype=int32)
>>> a[0:2]._mmap
<mmap.mmap object at 0x1b82ed50>
>>> a.sum()
memmap(40)
>>> a.sum()._mmap
<mmap.mmap object at 0x1b82ed50>

The problem is then that doing calculations of memmap objects, resulting in scalar results, previously returned a numpy scalar, with no reference to the memmap object. We could then just keep the result, and mark the memmap for garbage collection. Now, the memory usage of the system has increased dramatically, as we now longer have this option.

So, the question is twofold:

1) What is the reason behind this change? It makes sense to keep the reference to the mmap when slicing, but to go from a scalar value to the mmap does not seem very useful. Is there a possibility to return to the old solution?
2) If not, do you have any advice how we can retain the old solution without rewriting the system. We could cast the results of all functions on the memmap, but these are scattered throughout the system and would probably cause much headache. So we would rather implement a general solution, for instance wrapping the memmap object somehow. Do you have any ideas?

Connected to this is the rather puzzling fact that the 'new' memmap scalar object has an __iter__ method, but no length. Should not the __iter__ method be removed, as this signals that the object is iterable?

Before (1.5.1rc1):
>>> a[0:2].__iter__()
<iterator object at 0x3c22b10>
>>> len(a[0:2])
2
>>> a.sum().__iter__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '__iter__'
>>> len(a.sum())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'numpy.int64' has no len()

After (1.6.2):
>>> a[0:2].__iter__()
<iterator object at 0x1b7befd0>
>>> len(a[0:2])          
2
>>> a.sum().__iter__
<method-wrapper '__iter__' of memmap object at 0x1b7cab18>
>>> len(a.sum())        
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: len() of unsized object
>>> [x for x in a.sum()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iteration over a 0-d array

Regards,
Sveinung Gundersen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120702/d0685eff/attachment.html>