New subject: Change in memmap behaviour

2 Jul 2012

      [snip]
...
Your actual memory usage may not have increased as much as you think,
since memmap objects don't necessarily take much memory -- it sounds
like you're leaking virtual memory, but your resident set size
shouldn't go up as much.
As I understand it, memmap objects retain the contents of the memmap in memory after it has been read the first time (in a lazy manner). Thus, when reading a slice of a 24GB file, only that part recides in memory. Our system reads a slice of a memmap, calculates something (say, the sum), and then deletes the memmap. It then loops through this for consequitive slices, retaining a low memory usage. Consider the following code:

import numpy as np
res = []
vecLen = 3095677412
for i in xrange(vecLen/10**8+1): 
	x = i * 10**8
	y = min((i+1) * 10**8, vecLen)
	res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

The memory usage of this code on a 24GB file (one value for each nucleotide in the human DNA!) is 23g resident memory after the loop is finished (not 24g for some reason..).

Running the same code on 1.5.1rc1 gives a resident memory of 23m after the loop.
...
That said, this is clearly a bug, and it's even worse than you mention
-- *all* operations on memmap arrays are holding onto references to
the original mmap object, regardless of whether they share any memory:
...
...
...
a = np.memmap("/etc/passwd", np.uint8, "r")
 # arithmetic
(a + 10)._mmap is a._mmap
 True
 # fancy indexing (doesn't return a view!)
a[[1, 2, 3]]._mmap is a._mmap
 True
a.sum()._mmap is a._mmap
 True
Really, only slicing should be returning a np.memmap object at all.
Unfortunately, it is currently impossible to create an ndarray
subclass that returns base-class ndarrays from any operations --
__array_finalize__() has no way to do this. And this is the third
ndarray subclass in a row that I've looked at that wanted to be able
to do this, so I guess maybe it's something we should implement...
In the short term, the numpy-upstream fix is to change
numpy.core.memmap:memmap.__array_finalize__ so that it only copies
over the ._mmap attribute of its parent if np.may_share_memory(self,
parent) is True. Patches gratefully accepted ;-)
Great! Any idea on whether such a patch may be included in 1.7?
...
In the short term, you have a few options for hacky workarounds. You
could monkeypatch the above fix into the memmap class. You could
manually assign None to the _mmap attribute of offending arrays (being
careful only to do this to arrays where you know it is safe!). And for
reduction operations like sum() in particular, what you have right now
is not actually a scalar object -- it is a 0-dimensional array that
holds a single scalar. You can pull this scalar out by calling .item()
on the array, and then throw away the array itself -- the scalar won't
have any _mmap attribute.
 def scalarify(scalar_or_0d_array):
   if isinstance(scalar_or_0d_array, np.ndarray):
     return scalar_or_0d_array.item()
   else:
     return scalar_or_0d_array
 # works on both numpy 1.5 and numpy 1.6:
 total = scalarify(a.sum())
Thank you for this! However, such a solution would have to be scattered throughout the code (probably over 100 places), and I would rather not do that. I guess the abovementioned patch would be the best solution. I do not have experience in the numpy core code, so I am also eagerly awaiting such a patch!

Sveinung

--
Sveinung Gundersen
PhD Student, Bioinformatics, Dept. of Tumor Biology, Inst. for Cancer Research, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway
E-mail: sveinung.gundersen@medisin.uio.no, Phone: +47 93 00 94 54

Re: [Numpy-discussion] Change in memmap behaviour

Sveinung Gundersen

Nathaniel Smith

Sveinung Gundersen

Thouis (Ray) Jones

Nathaniel Smith

Nathaniel Smith

tags

participants (3)