[Numpy-discussion] Efficient way to load a 1Gb file?

Tue Aug 23 12:07:08 EDT 2011

On 11.08.2011, at 8:50PM, Russell E. Owen wrote:

> It seems a shame that loadtxt has no argument for predicted length, 
> which would allow preallocation and less appending/copying data.
> 
> And yes...reading the whole file first to figure out how many elements 
> it has seems sensible to me -- at least as a switchable behavior, and 
> preferably the default. 1Gb isn't that large in modern systems, but 
> loadtxt is filing up all 6Gb of RAM reading it!

1 GB is indeed not much in terms of disk space these days, but using text 
files for such data amounts is nonetheless very much non-state-of-the-art ;-)
That said, of course there is no justification to use excessive amounts of 
memory where it could be avoided! 
Implementing the above scheme for npyio is not quite as straightforward 
as in the example I gave before, mainly for the following reasons: 

loadtxt also has to deal with more complex data like structured arrays, 
plus comments, empty lines etc., meaning it has to find and count the 
actual valid data lines. 

Ideally, genfromtxt, which offers yet more functionality to deal with missing 
data, should offer the same options, but they would be certainly more 
difficult to implement there. 

More than 6 GB is still remarkable - from what info I found in the web, lists 
seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64 
array. The text representation would typically take 10-20 char's for one 
float (though with <12 digits, they could usually be read as float32 without 
loss of precision). Thus a factor >6 seems quite extreme, unless the file 
is full of (relatively) short integers...
But this also means copying of the final array would still have a relatively 
low memory footprint compared to the buffer list, thus using some kind of 
mutable array type for reading should be a reasonable solution as well. 
Unfortunately fromiter is not of that much use here since it only reads 
1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now 
I did go the 2x read approach with loadtxt, it turned out to add only ~10% 
to the read-in time. For compressed files this goes up to 30-50%, but 
once physical memory is exhausted it should probably actually become 
faster. 

I've made a pull request 
https://github.com/numpy/numpy/pull/144
implementing that option as a switch 'prescan'; could you review it in 
particular regarding the following:

Is the option reasonably named and documented?

In the case the allocated array does not match the input data (which 
really should never happen), right now just a warning is issued, 
filling any excess buffer with zeros or discarding remaining input data - 
should this rather raise an IndexError?

No prediction if/when I might be able to provide this for genfromtxt, sorry!

Cheers,
							Derek