
On 02.09.2011, at 5:50PM, Chris.Barker wrote:
hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.
In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.
So there is little cost, and for the common use case, it would be faster and cleaner.
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.
I still haven't studied your class in detail, but one could probably actually just create a copy of the array read in so far, e.g. changing it from a dtype=[('f0', '<i8'), ('f1', '<f8')] to dtype=[('f0', '<f8'), ('f1', '<f8')] as required - or even first implement it as a list or dict of arrays, that could be individually changed and only create a record array from that at the end. The required copying and extra memory use would definitely pale compared to the text parsing or the current memory usage for the input list. In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing the text for comment lines adds ca. 10% time, while any of the array allocation and copying operations should at most be at the 1% level.
enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?)
I _think_this would benefit from a one-pass solution as well -- so you don't need to de-compress twice.
Absolutely; on compressed data the time for the extra pass jumps up to +30-50%. Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------