[Numpy-discussion] Possible roadmap addendum: building better text file readers

Wed Feb 29 14:39:18 EST 2012

On Wed, Feb 29, 2012 at 7:57 PM, Erin Sheldon <erin.sheldon at gmail.com>wrote:

> Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
> > On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.sheldon at gmail.com>
> wrote:
> > > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500
> 2012:
> > >> > Even for binary, there are pathological cases, e.g. 1) reading a
> random
> > >> > subset of nearly all rows.  2) reading a single column when rows are
> > >> > small.  In case 2 you will only go this route in the first place if
> you
> > >> > need to save memory.  The user should be aware of these issues.
> > >>
> > >> FWIW, this route actually doesn't save any memory as compared to
> np.memmap.
> > >
> > > Actually, for numpy.memmap you will read the whole file if you try to
> > > grab a single column and read a large fraction of the rows.  Here is an
> > > example that will end up pulling the entire file into memory
> > >
> > >    mm=numpy.memmap(fname, dtype=dtype)
> > >    rows=numpy.arange(mm.size)
> > >    x=mm['x'][rows]
> > >
> > > I just tested this on a 3G binary file and I'm sitting at 3G memory
> > > usage.  I believe this is because numpy.memmap only understands rows.
>  I
> > > don't fully understand the reason for that, but I suspect it is related
> > > to the fact that the ndarray really only has a concept of itemsize, and
> > > the fields are really just a reinterpretation of those bytes.  It may
> be
> > > that one could tweak the ndarray code to get around this.  But I would
> > > appreciate enlightenment on this subject.
> >
> > Ahh, that makes sense. But, the tool you are using to measure memory
> > usage is misleading you -- you haven't mentioned what platform you're
> > on, but AFAICT none of them have very good tools for describing memory
> > usage when mmap is in use. (There isn't a very good way to handle it.)
> >
> > What's happening is this: numpy read out just that column from the
> > mmap'ed memory region. The OS saw this and decided to read the entire
> > file, for reasons discussed previously. Then, since it had read the
> > entire file, it decided to keep it around in memory for now, just in
> > case some program wanted it again in the near future.
> >
> > Now, if you instead fetched just those bytes from the file using
> > seek+read or whatever, the OS would treat that request in the exact
> > same way: it'd still read the entire file, and it would still keep the
> > whole thing around in memory. On Linux, you could test this by
> > dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
> > memory is listed as "free" in top, and then using your code to read
> > the same file -- you'll see that the 'free' memory drops by 3
> > gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
> > gigabytes.
> >
> > [Note: if you try this experiment, make sure that you don't have the
> > same file opened with np.memmap -- for some reason Linux seems to
> > ignore the request to drop_caches for files that are mmap'ed.]
> >
> > The difference between mmap and reading is that in the former case,
> > then this cache memory will be "counted against" your process's
> > resident set size. The same memory is used either way -- it's just
> > that it gets reported differently by your tool. And in fact, this
> > memory is not really "used" at all, in the way we usually mean that
> > term -- it's just a cache that the OS keeps, and it will immediately
> > throw it away if there's a better use for that memory. The only reason
> > it's loading the whole 3 gigabytes into memory in the first place is
> > that you have >3 gigabytes of memory to spare.
> >
> > You might even be able to tell the OS that you *won't* be reading that
> > file again, so there's no point in keeping it all cached -- on Unix
> > this is done via the madavise() or posix_fadvise() syscalls. (No
> > guarantee the OS will actually listen, though.)
>
> This is interesting, and on my machine I think I've verified that what
> you say is actually true.
>
> This all makes theoretical sense, but goes against some experiments I
> and my colleagues have done.  For example, a colleague of mine was able
> to read a couple of large files in using my code but not using memmap.
> The combined files were greater than memory size.  With memmap the code
> started swapping.  This was on 32-bit OSX.  But as I said, I just tested
> this on my linux box and it works fine with numpy.memmap.   I don't have
> an OSX box to test this.
>

I've seen this on OS X too. Here's another example on Linux:
http://thread.gmane.org/gmane.comp.python.numeric.general/43965. Using
tcmalloc was reported by a couple of people to solve that particular issue.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120229/45f20a6f/attachment.html>