[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt
jeffreback at gmail.com
Sun Oct 26 10:09:39 EDT 2014
you are describing a special case where you know the data size apriori (eg not streaming), dtypes are readily apparent from a small sample case
and in general your data is not messy
I would agree if these can be satisfied then you can achieve closer to a 1x memory overhead
using bcolZ is great but prob not a realistic option for a dependency for numpy (you should prob just memory map it directly instead); though this has a big perf impact - so need to weigh these things
not all cases deserve the same treatment - chunking is often the best option IMHO - provides a constant memory usage (though ultimately still 2x); but combined with memory mapping can provide a fixed resource utilization
> On Oct 26, 2014, at 9:41 AM, Daπid <davidmenhur at gmail.com> wrote:
>> On 26 October 2014 12:54, Jeff Reback <jeffreback at gmail.com> wrote:
>> you should have a read here/
>> going below the 2x memory usage on read in is non trivial and costly in terms of performance
> If you know in advance the number of rows (because it is in the header, counted with wc -l, or any other prior information) you can preallocate the array and fill in the numbers as you read, with virtually no overhead.
> If the number of rows is unknown, an alternative is to use a chunked data container like Bcolz  (former carray) instead of Python structures. It may be used as such, or copied back to a ndarray if we want the memory to be aligned. Including a bit of compression we can get the memory overhead to somewhere under 2x (depending on the dataset), at the cost of not so much CPU time, and this could be very useful for large data and slow filesystems.
>  http://bcolz.blosc.org/
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion