[Numpy-discussion] String manipulation summary

Christopher Barker Chris.Barker at noaa.gov
Mon Jul 27 19:29:16 EDT 2009


Hi all,

When I first saws this problem: reading in a fixed-width text file as 
numbers, it struck me that you really should be able to do it, and do it 
well, with numpy by slicing character arrays.

I got carried away, and worked out a number of ways to do it. Lastly was 
a method inspired by a recent thread: "String to integer array of ASCII 
values", which did indeed inspire the fastest way. Here's what I have :

# my naive first attempt:
def line2array0(line, field_len):
     nums = []
     i = 0
     while i < len(line):
         nums.append(float(line[i:i+field_len]))
         i += field_len
     return np.array(nums)

# list comprehension
def line2array1(line, field_len):
     return np.array(map(float,[line[i*field_len:(i+1)*field_len] for i 
in range(len(line)/field_len)]))

# convert to a tuple, then to an 'S1' array -- no real reason to do
# this, as I figured out the next way.
def line2array2(line, field_len):
     return np.array(tuple(line), dtype = 
'S1').view(dtype='S%i'%field_len).astype(np.float)

# convert directly to a string array, then break into fields.
def line2array3(line, field_len):
     return np.array((line,)).view(dtype='S%i'%field_len).astype(np.float)

# use dtype-'c' instead of 'S1' -- better.
def line2array4(line, field_len):
     return np.array(line, 
dtype='c').view(dtype='S%i'%field_len).astype(np.float)

# and the winner is: use fromstring to go straight to a 'c' array:
def line2array5(line, field_len):
     return np.fromstring(line, 
dtype='c').view(dtype='S%i'%field_len).astype(np.float)

Here are some timings:

Timing with a 10 number string:
List comp: 36.8073430061
convert to tuple: 57.9741871357
auto convert: 43.4103589058
char type: 46.0047719479
fromstring: 23.998103857
without float conversion: 11.4827179909

So list comprehension is pretty fast, but using fromstring, and then 
slicing is much better. The last one is the same thing, but without the 
convertion from strings to float, showing that that's a big chunk of 
time no matter how you slice it.

Timing with a 100 number string:
List comp: 163.281736135
convert to tuple: 333.081432104
auto convert: 138.934411049
char type: 279.897207975
fromstring: 121.395509005
without float conversion: 12.8342208862


Interesting -- I thought a longer string would give greater advantage to 
fromstring approach -- but I was wrong, now the time to parse strings 
into floats is really washing everything else out -- so it doesn't 
matter much how you do it, though I'd go with either list comprehension 
(which is what I think is used in np.genfromtxt), or the fromstring 
method, which I kind of like 'cause it's numpy.

test and timing code attached.

-Chris







-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.py
Type: application/x-python
Size: 3584 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090727/ab86dd69/attachment.bin>


More information about the NumPy-Discussion mailing list