[Numpy-discussion] memory-efficient loadtxt

Wed Oct 3 11:48:53 EDT 2012

On Monday, October 1, 2012, Chris Barker wrote:

> Paul,
>
> Nice to see someone working on these issues, but:
>
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.
>
> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.
>
> I also have a Cython version that is not quite done (darn regular job
> getting in the way) that is both faster and more memory efficient.
>
> Also, frankly, just writing the array pre-allocation and re-sizeing
> code into loadtxt would not be a whole lot of code either, and would
> be both fast and memory efficient.
>
> Let mw know if you want any of my code to play with.
>
> >  However, I got the impression that someone was
> > working on a More Advanced (TM) C-based file reader, which will
> > replace loadtxt;
>
> yes -- I wonder what happened with that? Anyone?
>
> -CHB
>
>
>
> this patch is intended as a useful thing to have
> > while we're waiting for that to appear.
> >
> > The patch passes all tests in the test suite, and documentation for
> > the kwarg has been added. I've modified all tests to include the
> > seekable kwarg, but that was mostly to check that all tests are passed
> > also with this kwarg. I guess it's bit too late for 1.7.0 though?
> >
> > Should I make a pull request? I'm happy to take any and all
> > suggestions before I do.
> >
> > Cheers
> > Paul
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org <javascript:;>
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov <javascript:;>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org <javascript:;>
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

I've finally built a new, very fast C-based tokenizer/parser with type
inference, NA-handling, etc. for pandas sporadically over the last month--
it's almost ready to ship. It's roughly an order of magnitude faster than
loadtxt and uses very little temporary space. Should be easy to push
upstream into NumPy to replace the innards of np.loadtxt if I can get a bit
of help with the plumbing (it already yields structured arrays in addition
to pandas DataFrames so there isn't a great deal that needs doing).

Blog post with CPU and memory benchmarks to follow-- will post a link here.

- Wes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20121003/3e4fac02/attachment.html>