[Numpy-discussion] memory-efficient loadtxt
Paul Anton Letnes
paul.anton.letnes at gmail.com
Wed Oct 3 12:05:49 EDT 2012
On 1. okt. 2012, at 21:07, Chris Barker wrote:
> Nice to see someone working on these issues, but:
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.
Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists. My guesstimate from my hacking session (off the top of my head - I don't have my benchmarks in working memory :) is around 3-5 times more memory with the list-of-lists approach for a single column / 1D array, which presumably is the worst case (a length 1 list for each line of input). Hence, if you want to load a 2 GB file into RAM on a machine with 4 GB of the stuff, you're screwed with the old approach and a happy camper with mine.
> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.
I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?
More information about the NumPy-Discussion