[Numpy-discussion] String manipulation summary

Mon Jul 27 19:47:28 EDT 2009

what machine spec are you using?

Using your last function line2array5 WITH float conversion, i get the
following timing on a mobile quad core extreme:

In [24]: a = np.arange(100).astype(str).tostring()

In [25]: a
Out[25]: '0123456789111111111122222222223333333333444444444455555555556666666666777777777788888888889999999999'

In [26]: %timeit line2array(a, 1)
10000 loops, best of 3: 37.1 µs per loop

In [27]: a = np.arange(1000).astype(str).tostring()

In [28]: %timeit line2array(a, 10)
10000 loops, best of 3: 45.2 µs per loop

Cheers,

Chris

On Mon, Jul 27, 2009 at 7:29 PM, Christopher
Barker<Chris.Barker at noaa.gov> wrote:
> Hi all,
>
> When I first saws this problem: reading in a fixed-width text file as
> numbers, it struck me that you really should be able to do it, and do it
> well, with numpy by slicing character arrays.
>
> I got carried away, and worked out a number of ways to do it. Lastly was a
> method inspired by a recent thread: "String to integer array of ASCII
> values", which did indeed inspire the fastest way. Here's what I have :
>
> # my naive first attempt:
> def line2array0(line, field_len):
>    nums = []
>    i = 0
>    while i < len(line):
>        nums.append(float(line[i:i+field_len]))
>        i += field_len
>    return np.array(nums)
>
> # list comprehension
> def line2array1(line, field_len):
>    return np.array(map(float,[line[i*field_len:(i+1)*field_len] for i in
> range(len(line)/field_len)]))
>
> # convert to a tuple, then to an 'S1' array -- no real reason to do
> # this, as I figured out the next way.
> def line2array2(line, field_len):
>    return np.array(tuple(line), dtype =
> 'S1').view(dtype='S%i'%field_len).astype(np.float)
>
> # convert directly to a string array, then break into fields.
> def line2array3(line, field_len):
>    return np.array((line,)).view(dtype='S%i'%field_len).astype(np.float)
>
> # use dtype-'c' instead of 'S1' -- better.
> def line2array4(line, field_len):
>    return np.array(line,
> dtype='c').view(dtype='S%i'%field_len).astype(np.float)
>
> # and the winner is: use fromstring to go straight to a 'c' array:
> def line2array5(line, field_len):
>    return np.fromstring(line,
> dtype='c').view(dtype='S%i'%field_len).astype(np.float)
>
> Here are some timings:
>
> Timing with a 10 number string:
> List comp: 36.8073430061
> convert to tuple: 57.9741871357
> auto convert: 43.4103589058
> char type: 46.0047719479
> fromstring: 23.998103857
> without float conversion: 11.4827179909
>
> So list comprehension is pretty fast, but using fromstring, and then slicing
> is much better. The last one is the same thing, but without the convertion
> from strings to float, showing that that's a big chunk of time no matter how
> you slice it.
>
> Timing with a 100 number string:
> List comp: 163.281736135
> convert to tuple: 333.081432104
> auto convert: 138.934411049
> char type: 279.897207975
> fromstring: 121.395509005
> without float conversion: 12.8342208862
>
>
> Interesting -- I thought a longer string would give greater advantage to
> fromstring approach -- but I was wrong, now the time to parse strings into
> floats is really washing everything else out -- so it doesn't matter much
> how you do it, though I'd go with either list comprehension (which is what I
> think is used in np.genfromtxt), or the fromstring method, which I kind of
> like 'cause it's numpy.
>
> test and timing code attached.
>
> -Chris
>
>
>
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>