[Numpy-discussion] Possible roadmap addendum: building better text file readers

Wed Feb 29 10:11:51 EST 2012

Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
> > Even for binary, there are pathological cases, e.g. 1) reading a random
> > subset of nearly all rows.  2) reading a single column when rows are
> > small.  In case 2 you will only go this route in the first place if you
> > need to save memory.  The user should be aware of these issues.
> 
> FWIW, this route actually doesn't save any memory as compared to np.memmap.

Actually, for numpy.memmap you will read the whole file if you try to
grab a single column and read a large fraction of the rows.  Here is an
example that will end up pulling the entire file into memory

    mm=numpy.memmap(fname, dtype=dtype)
    rows=numpy.arange(mm.size)
    x=mm['x'][rows]

I just tested this on a 3G binary file and I'm sitting at 3G memory
usage.  I believe this is because numpy.memmap only understands rows.  I
don't fully understand the reason for that, but I suspect it is related
to the fact that the ndarray really only has a concept of itemsize, and
the fields are really just a reinterpretation of those bytes.  It may be
that one could tweak the ndarray code to get around this.  But I would
appreciate enlightenment on this subject.

This fact was the original motivator for writing my code; the text
reading ability came later.

> Cool. I'm just a little concerned that, since we seem to have like...
> 5 different implementations of this stuff all being worked on at the
> same time, we need to get some consensus on which features actually
> matter, so they can be melded together into the Single Best File
> Reader Evar. An interface where indexing and file-reading are combined
> is significantly more complicated than one where the core file-reading
> inner-loop can ignore indexing. So far I'm not sure why this
> complexity would be worthwhile, so that's what I'm trying to
> understand.

I think I've addressed the reason why the low level C code was written.
And I think a unified, high level interface to binary and text files,
which the Recfile class provides, is worthwhile.

Can you please say more about "...one where the core file-reading
inner-loop can ignore indexing"?  I didn't catch the meaning.

-e

> 
> Cheers,
> -- Nathaniel
> 
> > Also, for some crazy ascii files we may want to revert to pure python
> > anyway, but I think these should be special cases that can be flagged
> > at runtime through keyword arguments to the python functions.
> >
> > BTW, did you mean to go off-list?
> >
> > cheers,
> >
> > -e
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
-- 
Erin Scott Sheldon
Brookhaven National Laboratory