[Numpy-discussion] memory-efficient loadtxt
Paul Anton Letnes
paul.anton.letnes at gmail.com
Sun Sep 30 10:14:45 EDT 2012
I've modified loadtxt to make it (potentially) more memory efficient.
The idea is that if a user passes a seekable file, (s)he can also pass
the 'seekable=True' kwarg. Then, loadtxt will count the number of
lines (containing data) and allocate an array of exactly the right
size to hold the loaded data. The downside is that the line counting
more than doubles the runtime, as it loops over the file twice, and
there's a sort-of unnecessary np.array function call in the loop. The
branch is called faster-loadtxt, which is silly due to the runtime
doubling, but I'm hoping that the false advertising is acceptable :)
(I naively expected a speedup by removing some needless list
I'm pretty sure that the function can be micro-optimized quite a bit
here and there, and in particular, the main for loop is a bit
duplicated right now. However, I got the impression that someone was
working on a More Advanced (TM) C-based file reader, which will
replace loadtxt; this patch is intended as a useful thing to have
while we're waiting for that to appear.
The patch passes all tests in the test suite, and documentation for
the kwarg has been added. I've modified all tests to include the
seekable kwarg, but that was mostly to check that all tests are passed
also with this kwarg. I guess it's bit too late for 1.7.0 though?
Should I make a pull request? I'm happy to take any and all
suggestions before I do.
More information about the NumPy-Discussion