How to limit the numpy.memmap's RAM usage?

Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap? LittleBigBrain

2010/10/23 braingateway <braingateway@gmail.com>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ? David

David Cournapeau :
2010/10/23 braingateway <braingateway@gmail.com>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ?
David _______________________________________________
Hi David, I agree with you about the point of using memmap. That is why the behavior is so strange to me. I actually measure the size of resident set (pink trace in figure2) of the python process on Windows. Here I attached the result. You can see the RAM usage is definitely not file system cache. LittleBigBrain

On Sat, Oct 23, 2010 at 9:44 AM, braingateway <braingateway@gmail.com>wrote:
David Cournapeau :
2010/10/23 braingateway <braingateway@gmail.com>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ?
David _______________________________________________
Hi David,
I agree with you about the point of using memmap. That is why the behavior is so strange to me. I actually measure the size of resident set (pink trace in figure2) of the python process on Windows. Here I attached the result. You can see the RAM usage is definitely not file system cache.
Umm, a good operating system will use *all* of ram for buffering because ram is fast and it assumes you are likely to reuse data you have already used once. If it needs some memory for something else it just writes a page to disk, if dirty, and reads in the new data from disk and changes the address of the page. Where you get into trouble is if pages can't be evicted for some reason. Most modern OS's also have special options available for reading in streaming data from disk that can lead to significantly faster access for that sort of thing, but I don't think you can do that with memmapped files. I'm not sure how windows labels it's memory. IIRC, Memmaping a file leads to what is called file backed memory, it is essentially virtual memory. Now, I won't bet my life that there isn't a problem, but I think a misunderstanding of the memory information is more likely. Chuck

On Sat, Oct 23, 2010 at 10:15 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Sat, Oct 23, 2010 at 9:44 AM, braingateway <braingateway@gmail.com>wrote:
David Cournapeau :
2010/10/23 braingateway <braingateway@gmail.com>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ?
David _______________________________________________
Hi David,
I agree with you about the point of using memmap. That is why the behavior is so strange to me. I actually measure the size of resident set (pink trace in figure2) of the python process on Windows. Here I attached the result. You can see the RAM usage is definitely not file system cache.
Umm, a good operating system will use *all* of ram for buffering because ram is fast and it assumes you are likely to reuse data you have already used once. If it needs some memory for something else it just writes a page to disk, if dirty, and reads in the new data from disk and changes the address of the page. Where you get into trouble is if pages can't be evicted for some reason. Most modern OS's also have special options available for reading in streaming data from disk that can lead to significantly faster access for that sort of thing, but I don't think you can do that with memmapped files.
I'm not sure how windows labels it's memory. IIRC, Memmaping a file leads to what is called file backed memory, it is essentially virtual memory. Now, I won't bet my life that there isn't a problem, but I think a misunderstanding of the memory information is more likely.
It is also possible that something else in your program is hanging onto memory but without knowing a lot more it is hard to tell. Are you seeing symptoms besides the memory graphs? It looks like you aren't running on windows, actually, so what OS are you running on? Chuck

Charles R Harris :
On Sat, Oct 23, 2010 at 10:15 AM, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote:
On Sat, Oct 23, 2010 at 9:44 AM, braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com>> wrote:
David Cournapeau :
2010/10/23 braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com>>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ?
David _______________________________________________
Hi David,
I agree with you about the point of using memmap. That is why the behavior is so strange to me. I actually measure the size of resident set (pink trace in figure2) of the python process on Windows. Here I attached the result. You can see the RAM usage is definitely not file system cache.
Umm, a good operating system will use *all* of ram for buffering because ram is fast and it assumes you are likely to reuse data you have already used once. If it needs some memory for something else it just writes a page to disk, if dirty, and reads in the new data from disk and changes the address of the page. Where you get into trouble is if pages can't be evicted for some reason. Most modern OS's also have special options available for reading in streaming data from disk that can lead to significantly faster access for that sort of thing, but I don't think you can do that with memmapped files.
I'm not sure how windows labels it's memory. IIRC, Memmaping a file leads to what is called file backed memory, it is essentially virtual memory. Now, I won't bet my life that there isn't a problem, but I think a misunderstanding of the memory information is more likely.
It is also possible that something else in your program is hanging onto memory but without knowing a lot more it is hard to tell. Are you seeing symptoms besides the memory graphs? It looks like you aren't running on windows, actually, so what OS are you running on?
Chuck ------------------------------------------------------------------------
Hi Chuck, Thanks a lot for quick response. I do run following supper simple script on windows: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Everything became supper slow after python ate all the RAM. By the way, I also tried Qt QFile::map() there is no problem at all... LittleBigBrain

On Sat, Oct 23, 2010 at 10:27 AM, braingateway <braingateway@gmail.com>wrote:
Charles R Harris :
On Sat, Oct 23, 2010 at 10:15 AM, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote:
On Sat, Oct 23, 2010 at 9:44 AM, braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com>> wrote:
David Cournapeau :
2010/10/23 braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com>>:
Hi everyone, I noticed the numpy.memmap using RAM to buffer data from memmap files. If I get a 100GB array in a memmap file and process it block by block, the RAM usage is going to increasing with the process running until there is no available space in RAM (4GB), even though the block size is only 1MB. for example: #### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Is there any way to restrict the memory usage in numpy.memmap?
The whole point of using memmap is to let the OS do the buffering for you (which is likely to do a better job than you in many cases). Which OS are you using ? And how do you measure how much memory is taken by numpy for your array ?
David _______________________________________________
Hi David,
I agree with you about the point of using memmap. That is why the behavior is so strange to me. I actually measure the size of resident set (pink trace in figure2) of the python process on Windows. Here I attached the result. You can see the RAM usage is definitely not file system cache.
Umm, a good operating system will use *all* of ram for buffering because ram is fast and it assumes you are likely to reuse data you have already used once. If it needs some memory for something else it just writes a page to disk, if dirty, and reads in the new data from disk and changes the address of the page. Where you get into trouble is if pages can't be evicted for some reason. Most modern OS's also have special options available for reading in streaming data from disk that can lead to significantly faster access for that sort of thing, but I don't think you can do that with memmapped files.
I'm not sure how windows labels it's memory. IIRC, Memmaping a file leads to what is called file backed memory, it is essentially virtual memory. Now, I won't bet my life that there isn't a problem, but I think a misunderstanding of the memory information is more likely.
It is also possible that something else in your program is hanging onto memory but without knowing a lot more it is hard to tell. Are you seeing symptoms besides the memory graphs? It looks like you aren't running on windows, actually, so what OS are you running on?
Chuck ------------------------------------------------------------------------
Hi Chuck,
Thanks a lot for quick response. I do run following supper simple script on windows:
#### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Everything became supper slow after python ate all the RAM. By the way, I also tried Qt QFile::map() there is no problem at all...
Hmm. Nothing looks suspicious. For reference, can you be specific about the OS/version, python version, and numpy version? What happens if you simply do for i in range(0,len(a)/blocklen): a[i*blocklen:(i+1)*blocklen].copy() Chuck

Charles R Harris
On Sat, Oct 23, 2010 at 10:27 AM, braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com>> wrote:
Charles R Harris : > > > On Sat, Oct 23, 2010 at 10:15 AM, Charles R Harris > <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com> <mailto:charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>>> wrote: > > > > On Sat, Oct 23, 2010 at 9:44 AM, braingateway > <braingateway@gmail.com <mailto:braingateway@gmail.com> <mailto:braingateway@gmail.com <mailto:braingateway@gmail.com>>> wrote: > > David Cournapeau : > > 2010/10/23 braingateway <braingateway@gmail.com <mailto:braingateway@gmail.com> > <mailto:braingateway@gmail.com <mailto:braingateway@gmail.com>>>: > > > Hi everyone, > I noticed the numpy.memmap using RAM to buffer data > from memmap files. > If I get a 100GB array in a memmap file and process it > block by block, > the RAM usage is going to increasing with the process > running until > there is no available space in RAM (4GB), even though > the block size is > only 1MB. > for example: > #### > a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') > blocklen=1e5 > b=npy.zeros((len(a)/blocklen,)) > for i in range(0,len(a)/blocklen): > b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) > #### > Is there any way to restrict the memory usage in > numpy.memmap? > > > > The whole point of using memmap is to let the OS do the > buffering for > you (which is likely to do a better job than you in many > cases). Which > OS are you using ? And how do you measure how much memory > is taken by > numpy for your array ? > > David > _______________________________________________ > > > Hi David, > > I agree with you about the point of using memmap. That is why > the behavior is so strange to me. > I actually measure the size of resident set (pink trace in > figure2) of the python process on Windows. Here I attached the > result. You can see the RAM usage is definitely not file > system cache. > > > Umm, a good operating system will use *all* of ram for buffering > because ram is fast and it assumes you are likely to reuse data > you have already used once. If it needs some memory for something > else it just writes a page to disk, if dirty, and reads in the new > data from disk and changes the address of the page. Where you get > into trouble is if pages can't be evicted for some reason. Most > modern OS's also have special options available for reading in > streaming data from disk that can lead to significantly faster > access for that sort of thing, but I don't think you can do that > with memmapped files. > > I'm not sure how windows labels it's memory. IIRC, Memmaping a > file leads to what is called file backed memory, it is essentially > virtual memory. Now, I won't bet my life that there isn't a > problem, but I think a misunderstanding of the memory information > is more likely. > > > It is also possible that something else in your program is hanging > onto memory but without knowing a lot more it is hard to tell. Are you > seeing symptoms besides the memory graphs? It looks like you aren't > running on windows, actually, so what OS are you running on? > > Chuck > ------------------------------------------------------------------------ > > Hi Chuck,
Thanks a lot for quick response. I do run following supper simple script on windows:
#### a = numpy.memmap(‘a.bin’, dtype='float64', mode='r') blocklen=1e5 b=npy.zeros((len(a)/blocklen,)) for i in range(0,len(a)/blocklen): b[i]=npy.mean(a[i*blocklen:(i+1)*blocklen]) #### Everything became supper slow after python ate all the RAM. By the way, I also tried Qt QFile::map() there is no problem at all...
Hmm. Nothing looks suspicious. For reference, can you be specific about the OS/version, python version, and numpy version?
What happens if you simply do for i in range(0,len(a)/blocklen): a[i*blocklen:(i+1)*blocklen].copy()
Chuck
Hi Chuck, Here is the versions: print sys.version 2.6.5 (r265:79096, Mar 19 2010, 18:02:59) [MSC v.1500 64 bit (AMD64)] print numpy.__version__ 1.4.1 print sys.getwindowsversion() (5, 2, 3790, 2, 'Service Pack 2') Besides, a[i*blocklen:(i+1)*blocklen].copy() gave out the same result. LittleBigBrain

On Sun, Oct 24, 2010 at 12:44 AM, braingateway <braingateway@gmail.com> wrote:
I agree with you about the point of using memmap.
That is why the behavior is so strange to me.
I think it is expected. What kind of behavior were you expecting ? To be clear, if I have a lot of available ram, I expect memmap arrays to take almost all of it (virtual memroy ~ resident memory). Now, if at the same time, another process starts taking a lot of memory, I expect the OS to automatically lower resident memory for the process using memmap. I did a small experiment on mac os x, creating a giant mmap'd array in numpy, and at the same time running a small C program using mlock (to lock pages into physical memory). As soon as I lock a big area (where big means most of my physical ram), the python process dealing with the mmap area sees its resident memory decrease. As soon as I kill the C program locking the memory, the resident memory starts increasing again. cheers, David

Hi List, I had similar problems on windows. I tried to use memmaps to buffer a large amount of data and process it in chunks. But I found that whenever I tried to do this, I always ended filling up RAM completely which led to crashes of my python script with a MemoryError. This led me to consider, actually from an advice via this list, the module h5py, which has a nice numpy interface to the hdf5 file format. It seemed more clear to me with the h5py-module, what was being buffered on disk and what was stored in RAM. Cheers, Simon On Sun, Oct 24, 2010 at 2:15 AM, David Cournapeau <cournape@gmail.com>wrote:
On Sun, Oct 24, 2010 at 12:44 AM, braingateway <braingateway@gmail.com> wrote:
I agree with you about the point of using memmap.
That is why the behavior is so strange to me.
I think it is expected. What kind of behavior were you expecting ? To be clear, if I have a lot of available ram, I expect memmap arrays to take almost all of it (virtual memroy ~ resident memory). Now, if at the same time, another process starts taking a lot of memory, I expect the OS to automatically lower resident memory for the process using memmap.
I did a small experiment on mac os x, creating a giant mmap'd array in numpy, and at the same time running a small C program using mlock (to lock pages into physical memory). As soon as I lock a big area (where big means most of my physical ram), the python process dealing with the mmap area sees its resident memory decrease. As soon as I kill the C program locking the memory, the resident memory starts increasing again.
cheers,
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (4)
-
braingateway
-
Charles R Harris
-
David Cournapeau
-
Simon Lyngby Kokkendorff