[Numpy-discussion] Efficient way to load a 1Gb file?

Russell E. Owen rowen at uw.edu
Thu Sep 1 19:47:36 EDT 2011

In article 
<781AF0C6-B761-4ABB-9798-9385582536E5 at astro.physik.uni-goettingen.de>,
 Derek Homeier <derek at astro.physik.uni-goettingen.de> wrote:

> On 11.08.2011, at 8:50PM, Russell E. Owen wrote:
> > It seems a shame that loadtxt has no argument for predicted length, 
> > which would allow preallocation and less appending/copying data.
> > 
> > And yes...reading the whole file first to figure out how many elements 
> > it has seems sensible to me -- at least as a switchable behavior, and 
> > preferably the default. 1Gb isn't that large in modern systems, but 
> > loadtxt is filing up all 6Gb of RAM reading it!
> 1 GB is indeed not much in terms of disk space these days, but using text 
> files for such data amounts is nonetheless very much non-state-of-the-art ;-)
> That said, of course there is no justification to use excessive amounts of 
> memory where it could be avoided! 
> Implementing the above scheme for npyio is not quite as straightforward 
> as in the example I gave before, mainly for the following reasons: 
> loadtxt also has to deal with more complex data like structured arrays, 
> plus comments, empty lines etc., meaning it has to find and count the 
> actual valid data lines. 
> Ideally, genfromtxt, which offers yet more functionality to deal with missing 
> data, should offer the same options, but they would be certainly more 
> difficult to implement there. 
> More than 6 GB is still remarkable - from what info I found in the web, lists 
> seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64 
> array. The text representation would typically take 10-20 char's for one 
> float (though with <12 digits, they could usually be read as float32 without 
> loss of precision). Thus a factor >6 seems quite extreme, unless the file 
> is full of (relatively) short integers...
> But this also means copying of the final array would still have a relatively 
> low memory footprint compared to the buffer list, thus using some kind of 
> mutable array type for reading should be a reasonable solution as well. 
> Unfortunately fromiter is not of that much use here since it only reads 
> 1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now 
> I did go the 2x read approach with loadtxt, it turned out to add only ~10% 
> to the read-in time. For compressed files this goes up to 30-50%, but 
> once physical memory is exhausted it should probably actually become 
> faster. 
> I've made a pull request 
> https://github.com/numpy/numpy/pull/144
> implementing that option as a switch 'prescan'; could you review it in 
> particular regarding the following:
> Is the option reasonably named and documented?
> In the case the allocated array does not match the input data (which 
> really should never happen), right now just a warning is issued, 
> filling any excess buffer with zeros or discarding remaining input data - 
> should this rather raise an IndexError?
> No prediction if/when I might be able to provide this for genfromtxt, sorry!
> Cheers,
>                                                         Derek

This looks like a great improvement to me! I think the name is well 
chosen and the help is very clear.

A few comments:
- Might you rename the variable "l"? It is easily confused with the 
digit 1.
- I don't understand the l < n_valid test, so this may be off base, but 
I'm surprised that you first massage the data and then raise an 
exception. Is the massaged data any use after the exception is raised? 
Naively I would expect you to issue a warning instead of raising an 
exception if you are going to handle the error by massaging the data.

(It is a pity that your patch duplicates so much parsing code, but I 
don't see a better way to do it. Putting conditionals in the parsing 
loop to decide how to handle each line based on prescan would presumably 
slow things down too much.)


-- Russell

More information about the NumPy-Discussion mailing list