[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt
derek at astro.physik.uni-goettingen.de
Sun Oct 26 09:43:44 EDT 2014
On 26 Oct 2014, at 02:21 pm, Eelco Hoogendoorn <hoogendoorn.eelco at gmail.com> wrote:
> Im not sure why the memory doubling is necessary. Isnt it possible to preallocate the arrays and write to them? I suppose this might be inefficient though, in case you end up reading only a small subset of rows out of a mostly corrupt file? But that seems to be a rather uncommon corner case.
> Either way, id say a doubling of memory use is fair game for numpy. Generality is more important than absolute performance. The most important thing is that temporary python datastructures are avoided. That shouldn't be too hard to accomplish, and would realize most of the performance and memory gains, I imagine.
Preallocation is not straightforward because the parser needs to be able in general to work with streamed input.
I think I even still have a branch on github bypassing this on request (by keyword argument).
But a factor 2 is already a huge improvement over that factor ~6 coming from the current text readers buffering
the entire input as list of list of Python strings, not to speak of the vast performance gain from using a parser
implemented in C like pandas’ - in fact one of the last times this subject came up one suggestion was to steal
pandas.read_cvs and adopt as required.
Someone also posted some code or the draft thereof for using resizable arrays quite a while ago, which would
reduce the memory overhead for very large arrays.
More information about the NumPy-Discussion