[Numpy-discussion] load from text files Pull Request Review

Fri Sep 2 17:45:17 EDT 2011

On Fri, Sep 2, 2011 at 3:54 PM, Chris.Barker <Chris.Barker at noaa.gov> wrote:
> On 9/2/11 9:16 AM, Christopher Jordan-Squire wrote:
>>>> I agree it would make a very nice addition, and could complement my
>>>> pre-allocation option for loadtxt - however there I've also been made
>>>> aware that this approach breaks streamed input etc., so the buffer.resize(…)
>>>> methods in accumulator would be the better way to go.
>>>
>> I'll read more about this soon. I haven't thought about it, and I
>> didn't realize it was breaking anything.
>
> you could call it a missing feature, rather than breaking...
>
>>> hmmm -- it seems you could jsut as well be building the array as you go,
>>> and if you hit a change in the imput, re-set and start again.
>>>
>>
>> I hadn't thought of that. Interesting idea. I'm surprised that
>> completely resetting the array could be faster.
>
> releasing memory an re-allocating doesn't take long at all.
>
>>> In my tests, I'm pretty sure that the time spent file io and string
>>> parsing swamp the time it takes to allocate memory and set the values.
>>
>> In my tests, at least for a medium sized csv file (about 3000 rows by
>> 30 columns), about 10% of the time was determine the types in the
>> first read through and 90% of the time was sticking the data in the
>> array.
>
> I don't know how that can even be possible:
>
> Don't you have to load and parse the entire file to determine the data
> types?
>
> Once you've allocated, then all you are doing is setting a value in the
> array -- that has got to be fast.
>

It doesn't have to parse the entire file to determine the dtypes. It
builds up a regular expression for what it expects to see, in terms of
dtypes. Then it just loops over the lines, only parsing if the regular
expression doesn't match. It seems that a regex match is fast, but a
regex fail is expensive. But the regex fails should be fairly rare,
and are generally simple to catch.

It was more expensive to keep track of the sizes for each line, as the
doc string for loadtable describes. I couldn't find a good solution to
cover all cases, so there's a combination of options to allow the user
to pick the best case for them.

Setting array elements is not as fast for the masked record arrays.
You must set entire rows at a time, so I have to build up each row as
a list, and convert to a tuple, and then stuff it in the array. And
it's even slower for the record arrays with missing data because I
must branch between adding missing data versus adding real data. Might
that be the reason for the slower performance than you'd expect?

> Also, the second time around, you may be taking advantage of disk cache,
> so that should be faster for that reason.
>
> Even so -- you may be able to save much of that 10%.

I don't understand your meaning.

>
>> However, that particular test took more time for reading in because
>> the data was quoted (so converting '"3,25"' to a float took between
>> 1.5x and 2x as long as '3.25') and the datetime conversion is costly.
>
> Didn't you have to do all that on the first pass as well? Or are you
> only checking for gross info -- length of rows, etc?
>
>> Regardless, that suggests making the data loading faster is more
>> important than avoiding reading through the file twice. I guess that
>> intuition probably breaks if the data doesn't fit until memory,
>> though.
>
> if the data don't fit into memory, then you need to go to memmapped
> arrays or something -- a whole new ball of wax.
>
>>> There is a chance, of course, that you might have to re-wind and start
>>> over more than once, but I suspect that that is the rare case.
>>>
>> Perhaps. I know that in the 'really annoying dataset that loading
>> quickly and easily should be your use case' example I was given, about
>> half-way through the data one of the columns got its first
>> observation.
>
> OK -- but isn't that just one re-wind?
>

Sure. If it only happens for one column. But suppose your data is a
bunch of time series, one per column, where they each start at
different dates. You'd have a restart for each column. But I guess
that point is pedantic since regardless the number of columns should
be many less than the number of rows.

I wonder if there are any really important cases where you'd actually
lose something by simply recasting an entry to another dtype, as Derek
suggested. That would avoid having to go back to the start simply by
doing an in-place conversion of the data.

> On 9/2/11 9:17 AM, Derek Homeier wrote:
>
>>> There is a chance, of course, that you might have to re-wind and start
>>> over more than once, but I suspect that that is the rare case.
>>>
>> I still haven't studied your class in detail, but one could probably actually
>> just create a copy of the array read in so far, e.g. changing it from a
>> dtype=[('f0', '<i8'), ('f1', '<f8')] to dtype=[('f0', '<f8'), ('f1', '<f8')]  as required -
>
> good point -- that would be a better way to do it, and only a tiny bit
> harder.
>
>> or even first implement it as a list or dict of arrays, that could be individually
>> changed and only create a record array from that at the end.
>
> I think that's a method that the OP is specifically trying to avoid -- a
> list of arrays uses substantially more storage than an array. Though
> less than a list of lists If each row is long, infact, the list overhead
> would be small.
>
>> The required copying and extra memory use would definitely pale compared
>> to the text parsing or the current memory usage for the input list.
>
> That's what I expected -- the OP's timing seems to indicate otherwise,
> but I'm still skeptical as to what has been timed.
>

So here's some of the important output from prun (in ipython) on the
following call:

np.loadtable('biggun.csv', quoted=True, comma_decimals=True,
NA_re=r'#N/A N/A|', date_re='\d{1,2}/\d{1,2}/\d{2}',
date_strp='%m/%d/%y', header=True)

where biggun.csv was the 2857 by 25 csv file with datetimes and quoted
data I'd mentioned earlier. (It's proprietary data, so I can't share
the csv file itself.)

----------------------------------------------------------------------------------------------------------
         377584 function calls (377532 primitive calls) in 2.242 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.540    0.540    1.866    1.866 loadtable.py:859(get_data_missing)
    91615    0.460    0.000    0.460    0.000 {built-in method match}
    58835    0.286    0.000    0.405    0.000 loadtable.py:301(comma_float_conv)
   126267    0.270    0.000    0.270    0.000 {method 'replace' of
'str' objects}
     2857    0.125    0.000    0.151    0.000 core.py:2975(__setitem__)
     2857    0.113    0.000    0.295    0.000 _strptime.py:295(_strptime)
     2862    0.071    0.000    0.071    0.000 {numpy.core.multiarray.array}
     2857    0.053    0.000    0.200    0.000 loadtable.py:1304(update_sizes)
     2857    0.039    0.000    0.066    0.000 locale.py:316(normalize)
        1    0.028    0.028    0.373    0.373
loadtable.py:1165(get_nrows_sizes_coltypes)
     2857    0.024    0.000    0.319    0.000 {built-in method strptime}
     5796    0.021    0.000    0.021    0.000 {built-in method groups}
     2857    0.018    0.000    0.342    0.000 loadtable.py:784(<lambda>)
     2857    0.017    0.000    0.102    0.000 locale.py:481(getlocale)
     8637    0.016    0.000    0.016    0.000 {method 'get' of 'dict' objects}
     2857    0.016    0.000    0.016    0.000 {map}
     8631    0.015    0.000    0.015    0.000 {len}
---------------------------------------------------------------------------------------------------------------------

It goes on, but those seem to be the important calls. So I wasn't
quite right on the 90-10 split, but 99.9% of the time is in two
methods: getting the data (get_data_missing) and determining the
sizes, dtypes (get_nrows_sizes_coltypes). Between those two the split
is 17-83.

>> In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing
>> the text for comment lines adds ca. 10% time, while any of the array allocation
>> and copying operations should at most be at the 1% level.
>
> much more what I'd expect.
>
>> I had experimented a bit with the fromiter function, which also increases
>> the output array as needed, and this creates negligible overhead compared
>> to parsing the text input (it is implemented in C, though, I don't know how
>> the .resize() calls would compare to that;
>
> it's probably using pretty much the same code as .resize() internally
> anyway.
>
>>and unfortunately it's for 1D-arrays only).
>
> That's not bad for this use -- make a row a struct dtype, and you've got
> a 1-d array anyway -- you can optionally convert to a 2-d array after
> the fact.
>
> I don't know why I didn't think of using fromiter() when I build
> accumulator.  Though what I did is a bit more flexible -- you can add
> stuff later on, too, you don't need to do it allat once.
>

I'm unsure how to use fromiter for missing data. It sounds like a
potential solution when no data is missing, though.

-Chris

> -Chris
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>