Re: [Numpy-discussion] load from text files Pull Request Review

2 Sep 2011

      On 9/2/11 9:16 AM, Christopher Jordan-Squire wrote:
...
...
...
I agree it would make a very nice addition, and could complement my
pre-allocation option for loadtxt - however there I've also been made
aware that this approach breaks streamed input etc., so the buffer.resize(…)
methods in accumulator would be the better way to go.
I'll read more about this soon. I haven't thought about it, and I
didn't realize it was breaking anything.
you could call it a missing feature, rather than breaking...
...
...
hmmm -- it seems you could jsut as well be building the array as you go,
and if you hit a change in the imput, re-set and start again.
I hadn't thought of that. Interesting idea. I'm surprised that
completely resetting the array could be faster.
releasing memory an re-allocating doesn't take long at all.
...
...
In my tests, I'm pretty sure that the time spent file io and string
parsing swamp the time it takes to allocate memory and set the values.
In my tests, at least for a medium sized csv file (about 3000 rows by
30 columns), about 10% of the time was determine the types in the
first read through and 90% of the time was sticking the data in the
array.
I don't know how that can even be possible:

Don't you have to load and parse the entire file to determine the data 
types?

Once you've allocated, then all you are doing is setting a value in the 
array -- that has got to be fast.

Also, the second time around, you may be taking advantage of disk cache, 
so that should be faster for that reason.

Even so -- you may be able to save much of that 10%.
...
However, that particular test took more time for reading in because
the data was quoted (so converting '"3,25"' to a float took between
1.5x and 2x as long as '3.25') and the datetime conversion is costly.
Didn't you have to do all that on the first pass as well? Or are you 
only checking for gross info -- length of rows, etc?
...
Regardless, that suggests making the data loading faster is more
important than avoiding reading through the file twice. I guess that
intuition probably breaks if the data doesn't fit until memory,
though.
if the data don't fit into memory, then you need to go to memmapped 
arrays or something -- a whole new ball of wax.
...
...
There is a chance, of course, that you might have to re-wind and start
over more than once, but I suspect that that is the rare case.
Perhaps. I know that in the 'really annoying dataset that loading
quickly and easily should be your use case' example I was given, about
half-way through the data one of the columns got its first
observation.
OK -- but isn't that just one re-wind?

On 9/2/11 9:17 AM, Derek Homeier wrote:
...
...
There is a chance, of course, that you might have to re-wind and start
over more than once, but I suspect that that is the rare case.
I still haven't studied your class in detail, but one could probably actually
just create a copy of the array read in so far, e.g. changing it from a
dtype=[('f0', '
good point -- that would be a better way to do it, and only a tiny bit 
harder.
...
or even first implement it as a list or dict of arrays, that could be individually
changed and only create a record array from that at the end.
I think that's a method that the OP is specifically trying to avoid -- a 
list of arrays uses substantially more storage than an array. Though 
less than a list of lists If each row is long, infact, the list overhead 
would be small.
...
The required copying and extra memory use would definitely pale compared
to the text parsing or the current memory usage for the input list.
That's what I expected -- the OP's timing seems to indicate otherwise, 
but I'm still skeptical as to what has been timed.
...
In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing
the text for comment lines adds ca. 10% time, while any of the array allocation
and copying operations should at most be at the 1% level.
much more what I'd expect.
...
I had experimented a bit with the fromiter function, which also increases
the output array as needed, and this creates negligible overhead compared
to parsing the text input (it is implemented in C, though, I don't know how
the .resize() calls would compare to that;
it's probably using pretty much the same code as .resize() internally 
anyway.
...
and unfortunately it's for 1D-arrays only).
That's not bad for this use -- make a row a struct dtype, and you've got 
a 1-d array anyway -- you can optionally convert to a 2-d array after 
the fact.

I don't know why I didn't think of using fromiter() when I build 
accumulator.  Though what I did is a bit more flexible -- you can add 
stuff later on, too, you don't need to do it allat once.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov