
On Thu, Mar 11, 2010 at 02:26:49PM +0100, Francesc Alted wrote:
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case.
Mmh, this is not my experience. During the past month, I was proposing in a course the students to compare the memory consumption of numpy.memmap and tables.Expr (a module for performing out-of-memory computations in PyTables).
[snip]
So, in my experience, numpy.memmap is really using that large chunk of memory (unless my testbed is badly programmed, in which case I'd be grateful if you can point out what's wrong).
OK, so what you are saying is that my assertion #1 was wrong. Fair enough, as I was writing it I was thinking that I had no hard fact to back it. How about assertion #2? I can think only of this 'story' to explain why I can run parallel computation when I use memmap that blow up if I don't use memmap. Also, could it be that the memmap mode changes things? I use only the 'r' mode, which is read-only. This is all very interesting, and you have much more insights on these problems than me. Would you be interested in coming to Euroscipy in Paris to give a 1 or 2 hours long tutorial on memory and IO problems and how you address them with Pytables? It would be absolutely thrilling. I must warn that I am afraid that we won't be able to pay for your trip, though, as I want to keep the price of the conference low. Best, Gaƫl