
Sorry I'm only now getting around to thinking more about this. Been side-tracked by stats stuff. On Fri, Sep 2, 2011 at 10:50 AM, Chris.Barker <Chris.Barker@noaa.gov> wrote:
On 9/2/11 8:22 AM, Derek Homeier wrote:
I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go.
I'll read more about this soon. I haven't thought about it, and I didn't realize it was breaking anything.
Good point, that would be nice.
For load table this is not quite as straightforward, though, because the type auto-detection, strictly done, requires to scan the entire input, because a column full of int could still produce a float in the last row…
hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.
I hadn't thought of that. Interesting idea. I'm surprised that completely resetting the array could be faster.
In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.
In my tests, at least for a medium sized csv file (about 3000 rows by 30 columns), about 10% of the time was determine the types in the first read through and 90% of the time was sticking the data in the array. However, that particular test took more time for reading in because the data was quoted (so converting '"3,25"' to a float took between 1.5x and 2x as long as '3.25') and the datetime conversion is costly. Regardless, that suggests making the data loading faster is more important than avoiding reading through the file twice. I guess that intuition probably breaks if the data doesn't fit until memory, though. But I haven't worked with extremely large data files before, so I'd appreciate refutation/confirmation of my priors.
So there is little cost, and for the common use case, it would be faster and cleaner.
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.
Perhaps. I know that in the 'really annoying dataset that loading quickly and easily should be your use case' example I was given, about half-way through the data one of the columns got its first observation. (It was time series data where one of the columns didn't start being observed until 1/2 through the observation period.) So I'm not sure it'd be as rare we'd like.
For better consistency with what people have likely got used to from npyio, I'd recommend some minor changes:
make spaces the default delimiter
+1
Sure.
enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?)
I _think_this would benefit from a one-pass solution as well -- so you don't need to de-compress twice.
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion