[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

Sun Oct 26 09:41:07 EDT 2014

On 26 October 2014 12:54, Jeff Reback <jeffreback at gmail.com> wrote:

> you should have a read here/
> http://wesmckinney.com/blog/?p=543
>
> going below the 2x memory usage on read in is non trivial and costly in
> terms of performance
>

If you know in advance the number of rows (because it is in the header,
counted with wc -l, or any other prior information) you can preallocate the
array and fill in the numbers as you read, with virtually no overhead.

If the number of rows is unknown, an alternative is to use a chunked data
container like Bcolz [1] (former carray) instead of Python structures. It
may be used as such, or copied back to a ndarray if we want the memory to
be aligned. Including a bit of compression we can get the memory overhead
to somewhere under 2x (depending on the dataset), at the cost of not so
much CPU time, and this could be very useful for large data and slow
filesystems.

/David.

[1] http://bcolz.blosc.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20141026/c9986af1/attachment.html>