Memory leak/fragmentation when using np.memmap

Hello, I need to process several large (~40 GB) files. np.memmap seems ideal for this, but I have run into a problem that looks like a memory leak or memory fragmentation. The following code illustrates the problem import numpy as np x = np.memmap('mybigfile.bin',mode='r',dtype='uint8') print x.shape # prints (42940071360,) in my case ndat = x.shape[0] for k in range(1000): y = x[k*ndat/1000:(k+1)*ndat/1000].astype('float32') #The astype ensures that the data is read in from disk del y One would expect such a program would have a roughly constant memory footprint, but in fact 'top' shows that the RES memory continually increases. I can see that the memory usage is actually occurring because the OS eventually starts to swap to disk. The memory usage does not seem to correspond with the total size of the file. Has anyone see this behavior? Is there a solution? I found this article: http://pushingtheweb.com/2010/06/python-and-tcmalloc/ which sounds similar, but it seems that the ~40 MB chunks I am loading would be using mmap anyway so could be returned to the OS. I am using nearly the latest version of numpy from the git repository (np.__version__ returns 2.0.0.dev-Unknown), Python 2.7.1, and RHEL 5 on x86_64. I appreciate any suggestions. Thanks, Glenn

On Wed, 18 May 2011 15:09:31 -0700, G Jones wrote: [clip]
Your OS probably likes to keep the pages touched in memory and in swap, rather than dropping them. This happens at least on Linux. You can check that an equivalent simple C program displays the same behavior (use with file "data" with enough bytes): #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> int main() { unsigned long size = 2000000000; unsigned long i; char *p; int fd; char sum; fd = open("data", O_RDONLY); p = (char*)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0); sum = 0; for (i = 0; i < size; ++i) { sum += *(p + i); } munmap(p, size); close(fd); return 0; }

Hello, I have seen the effect you describe, I had originally assumed this was the case, but in fact there seems to be more to the problem. If it were only the effect you mention, there should not be any memory error because the OS will drop the pages when the memory is actually needed for something. At least I would hope so. If not, this seems like a huge problem for linux. As a followup, I managed to install tcmalloc as described in the article I mentioned. Running the example I sent now shows a constant memory foot print as expected. I am surprised such a solution was necessary. Certainly others must work with such large datasets using numpy/python? Thanks, Glenn On Wed, May 18, 2011 at 4:21 PM, Pauli Virtanen <pav@iki.fi> wrote:

On Wed, 18 May 2011 16:36:31 -0700, G Jones wrote: [clip]
Well, your example Python code works for me without any other changes, and it shows behavior identical to the C code. Things might depend on the version of the C library and the kernel, so it is quite possible that many do not see these issues. -- Pauli Virtanen
participants (3)
-
G Jones
-
Pauli Virtanen
-
Ralf Gommers