
A Thursday 11 March 2010 10:36:42 Gael Varoquaux escrigué:
On Thu, Mar 11, 2010 at 10:04:36AM +0100, Francesc Alted wrote:
As far as I know, memmap files (or better, the underlying OS) *use* all available RAM for loading data until RAM is exhausted and then start to use SWAP, so the "memory pressure" is still there. But I may be wrong...
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case.
Mmh, this is not my experience. During the past month, I was proposing in a course the students to compare the memory consumption of numpy.memmap and tables.Expr (a module for performing out-of-memory computations in PyTables). The idea was precisely to show that, contrarily to tables.Expr, numpy.memmap computations do take a lot of memory when they are being accessed. I'm attaching a slightly modified version of that exercise. On it, one have to compute a polynomial in a certain range. Here it is the output of the script for the numpy.memmap case for a machine with 8 GB RAM and 6 GB of swap: Total size for datasets: 7629.4 MB Populating x using numpy.memmap with 500000000 points... Total file sizes: 4000000000 -- (3814.7 MB) *** Time elapsed populating: 70.982 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using numpy.memmap Total file sizes: 8000000000 -- (7629.4 MB) **** Time elapsed computing: 81.727 10.08user 13.37system 2:33.26elapsed 15%CPU (0avgtext+0avgdata 0maxresident)k 7808inputs+15625008outputs (39major+5750196minor)pagefaults 0swaps While the computation was going on, I've spied the process with the top utility, and that told me that the total virtual size consumed by the Python process was 7.9 GB, with a total of *resident* memory of 6.7 GB (!). And this should not only be a top malfunction because I've checked that, by the end of the computation, my machine started to swap some processes out (i.e. the working set above was too large to allow the OS keep everything in memory). Now, just for the sake of comparison, I've tried running the same script but using tables.Expr. Here it is the output: Total size for datasets: 7629.4 MB Populating x using tables.Expr with 500000000 points... Total file sizes: 4000631280 -- (3815.3 MB) *** Time elapsed populating: 78.817 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr Total file sizes: 8001261168 -- (7630.6 MB) **** Time elapsed computing: 155.836 13.11user 18.59system 3:58.61elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k 7842784inputs+15632208outputs (28major+940347minor)pagefaults 0swaps and top was telling me that memory consumption was 148 MB for total virtual size and just 44 MB (as expected, because computation was really made using an out-of-core algorithm). Interestingly, when using compression (Blosc level 4, in this case), the time to do the computation with tables.Expr has reduced a lot: Total size for datasets: 7629.4 MB Populating x using tables.Expr with 500000000 points... Total file sizes: 1080130765 -- (1030.1 MB) *** Time elapsed populating: 30.005 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr Total file sizes: 2415761895 -- (2303.9 MB) **** Time elapsed computing: 40.048 37.11user 6.98system 1:12.88elapsed 60%CPU (0avgtext+0avgdata 0maxresident)k 45312inputs+4720568outputs (4major+989323minor)pagefaults 0swaps while memory consumption is barely the same than above: 148 MB / 45 MB. So, in my experience, numpy.memmap is really using that large chunk of memory (unless my testbed is badly programmed, in which case I'd be grateful if you can point out what's wrong). -- Francesc Alted