[Numpy-discussion] memory-efficient loadtxt

Wed Oct 3 12:36:28 EDT 2012

On 3. okt. 2012, at 18:22, Chris Barker wrote:

> On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes
> <paul.anton.letnes at gmail.com> wrote:
> 
>>> I'm not sure the problem you are trying to solve -- accumulating in a
>>> list is pretty efficient anyway -- not a whole lot overhead.
>> 
>> Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists.
> 
> hmm, a list of nupy scalers (custom dtype) would be a better option,
> though maybe not all that much better -- still an extra pointer and
> pyton object for each row.
> 
> 
>> I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?
> 
> Indeed, I think that's how my current accumulator works -- the
> __array__() method returns a copy of the data buffer, so that you
> won't accidentally re-allocate it under the hood later and screw up
> the returned version.
> 
> But it is indeed accumulating in a numpy array, so it should be pretty
> possible, maybe even easy to turn it into a regular array without a
> data copy. You'd just have to destroy the original somehow (or mark it
> as never-resize) so you wouldn't have the clash. messing wwith the
> OWNDATA flags might take care of that.
> 
> But it seems Wes has a better solution.

Indeed.

> One other note, though -- if you have arrays that are that close to
> max system memory, you are very likely to have other trouble anyway --
> numpy does make a lot of copies!

That's true. Now, I'm not worried about this myself, but several people have complained about this on the mailing list, and it seemed like an easy fix. Oh well, it's too late for it now, anyways.

Paul