[Numpy-discussion] load from text files Pull Request Review

Thu Sep 8 16:43:43 EDT 2011

On Wed, Sep 7, 2011 at 2:52 PM, Chris.Barker <Chris.Barker at noaa.gov> wrote:
> On 9/2/11 2:45 PM, Christopher Jordan-Squire wrote:
>> It doesn't have to parse the entire file to determine the dtypes. It
>> builds up a regular expression for what it expects to see, in terms of
>> dtypes. Then it just loops over the lines, only parsing if the regular
>> expression doesn't match. It seems that a regex match is fast, but a
>> regex fail is expensive.
>
> interesting -- I wouldn't have expected a regex to be faster that simple
> parsing, but that's why you profile!
>
>> Setting array elements is not as fast for the masked record arrays.
>> You must set entire rows at a time, so I have to build up each row as
>> a list, and convert to a tuple, and then stuff it in the array.
>
> hmmm -- that is a lot -- I was thinking of simple "set a value in an
> array". I"ve also done a bunch of this in C, where's it's really fast.
>
> However, rather than:
>
>   build a row as a list
>   build a row as a tuple
>   stuff into array
>
> could you create an empty array scalar, and fill that, then put that in
> your array:
>
> In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)])
>
> In [5]: dt
> Out[5]: dtype([('x', '<f4'), ('y', '<i4'), ('z', '<f8')])
>
> In [6]: temp = np.empty((), dtype=dt)
>
> In [9]: temp['x'] = 3
>
> In [10]: temp['y'] = 4
>
> In [11]: temp['z'] = 5
>
> In [13]: a = np.zeros((4,), dtype = dt)
>
> In [14]: a[0] = temp
>
> In [15]: a
> Out[15]:
> array([(3.0, 4, 5.0), (0.0, 0, 0.0), (0.0, 0, 0.0), (0.0, 0, 0.0)],
>       dtype=[('x', '<f4'), ('y', '<i4'), ('z', '<f8')])
>
>
> (and you could pass the array scalar into accumulator as well)
>
> maybe it wouldn't be any faster, but with re-using temp, and one less
> list-tuple conversion, and fewer python type to numpy type conversions,
> maybe it would.
>

I just ran a quick test on my machine of this idea. With

dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)])
temp = np.empty((), dtype=dt)
temp2 = np.zeros(1,dtype=dt)

In [96]: def f():
    ...:     l=[0]*3
    ...:     l[0] = 2.54
    ...:     l[1] = 4
    ...:     l[2] = 2.3645
    ...:     j = tuple(l)
    ...:     temp2[0] = j

vs

In [97]: def g():
    ...:     temp['x'] = 2.54
    ...:     temp['y'] = 4
    ...:     temp['z'] = 2.3645
    ...:     temp2[0] = temp
    ...:

The timing results were 2.73 us for f and 3.43 us for g. So good idea,
but it doesn't appear to be faster. (Though the difference wasn't
nearly as dramatic as I thought it would be, based on Pauli's
comment.)

-Chris JS

>> it's even slower for the record arrays with missing data because I
>> must branch between adding missing data versus adding real data. Might
>> that be the reason for the slower performance than you'd expect?
>
> could be -- I haven't thought about the missing data part much.
>
>> I wonder if there are any really important cases where you'd actually
>> lose something by simply recasting an entry to another dtype, as Derek
>> suggested.
>
> In general, it won't be a simple re-cast -- it will be a copy to a
> subset -- which may be hard to write the code, but would save having to
> re-parse the data.
>
>
> Anyway, you know the issues, this is good stuff either way.
>
> -Chris
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>