[Numpy-discussion] Change in memmap behaviour

Mon Jul 2 16:40:10 EDT 2012

On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen <sveinugu at gmail.com> wrote:
> [snip]
>
>
>
> Your actual memory usage may not have increased as much as you think,
> since memmap objects don't necessarily take much memory -- it sounds
> like you're leaking virtual memory, but your resident set size
> shouldn't go up as much.
>
>
> As I understand it, memmap objects retain the contents of the memmap in
> memory after it has been read the first time (in a lazy manner). Thus, when
> reading a slice of a 24GB file, only that part recides in memory. Our system
> reads a slice of a memmap, calculates something (say, the sum), and then
> deletes the memmap. It then loops through this for consequitive slices,
> retaining a low memory usage. Consider the following code:
>
> import numpy as np
> res = []
> vecLen = 3095677412
> for i in xrange(vecLen/10**8+1):
> x = i * 10**8
> y = min((i+1) * 10**8, vecLen)
> res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())
>
> The memory usage of this code on a 24GB file (one value for each nucleotide
> in the human DNA!) is 23g resident memory after the loop is finished (not
> 24g for some reason..).
>
> Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
> loop.

Your memory measurement tools are misleading you. The same memory is
resident in both cases, just in one case your tools say it is
operating system disk cache (and not attributed to your app), and in
the other case that same memory, treated in the same way by the OS, is
shown as part of your app's resident memory. Virtual memory is
confusing...

> That said, this is clearly a bug, and it's even worse than you mention
> -- *all* operations on memmap arrays are holding onto references to
> the original mmap object, regardless of whether they share any memory:
>
> a = np.memmap("/etc/passwd", np.uint8, "r")
>
>  # arithmetic
>
> (a + 10)._mmap is a._mmap
>
>  True
>  # fancy indexing (doesn't return a view!)
>
> a[[1, 2, 3]]._mmap is a._mmap
>
>  True
>
> a.sum()._mmap is a._mmap
>
>  True
> Really, only slicing should be returning a np.memmap object at all.
> Unfortunately, it is currently impossible to create an ndarray
> subclass that returns base-class ndarrays from any operations --
> __array_finalize__() has no way to do this. And this is the third
> ndarray subclass in a row that I've looked at that wanted to be able
> to do this, so I guess maybe it's something we should implement...
>
> In the short term, the numpy-upstream fix is to change
> numpy.core.memmap:memmap.__array_finalize__ so that it only copies
> over the ._mmap attribute of its parent if np.may_share_memory(self,
> parent) is True. Patches gratefully accepted ;-)
>
>
> Great! Any idea on whether such a patch may be included in 1.7?

Not really, if I or you or someone else gets inspired to take the time
to write a patch soon then it will be, otherwise not...

-N