[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt
njs at pobox.com
Sun Oct 26 10:16:11 EDT 2014
On 26 Oct 2014 11:54, "Jeff Reback" <jeffreback at gmail.com> wrote:
> you should have a read here/
> going below the 2x memory usage on read in is non trivial and costly in
terms of performance
On Linux you can probably go below 2x overhead easily, by exploiting the
fact that realloc on large memory blocks is basically O(1) (yes really):
Sadly osx does not provide anything similar and I can't tell for sure about
Though on further thought, the numbers Wes quotes there aren't actually the
most informative - massif will tell you how much virtual memory you have
allocated, but a lot of that is going to be a pure vm accounting trick. The
output array memory will actually be allocated incrementally one block at a
time as you fill it in. This means that if you can free each temporary
chunk immediately after you copy it into the output array, then even simple
approaches can have very low overhead. It's possible pandas's actual
overhead is already closer to 1x than 2x, and this is just hidden by the
tools Wes is using to measure it.
> On Oct 26, 2014, at 4:46 AM, Saullo Castro <saullogiovani at gmail.com>
>> I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.
>> The motivation came from this SO question:
>> where for huge arrays the current NumPy ASCII readers are really slow
and require ~6 times more memory. This case I tested with Pandas'
read_csv() and it required 2 times more memory.
>> I would be glad if you could share your experience on this matter.
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion