speeding up string.split()

Fri May 25 08:01:14 EDT 2001

I've lost the original message, so I'm replying through Duncan's message.

At 09:44 25/05/01 +0000, Duncan Booth wrote:
>Chris Green <cmg at uab.edu> wrote in
>news:m2n182cs9c.fsf at phosphorus.tucc.uab.edu:
> > #!/usr/bin/python
> > from string import split
> >
> > for i in range(300000):
> >     array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' +
> >                   '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
>Speedups to the above code:
>1. The variable array is not used after it is assigned, and the assignment
>is constant. factor the assignment out of the loop.
>2. After 1, the loop is empty, remove the loop.
>1 and 2 together provide a massive speed improvement with no loss of
>functionality to the code as given.

It seems to me that this is an "artificial" loop, done for the sole reason 
to allow timing measurement. This is not slowing things down because what 
matters is the time per iteration.

>Alternatively:
>3. Put the code inside a function.

This one deserves some explanation. There are some little optimizations 
than Python does inside functions to access local variables. However, it 
seems that this is not the issue here.

>4. Use the split method on the string instead of the split function

Hummm. Again the same; it does not seem to make much difference.

>5. Use string concatenation instead of '+'
>3, 4 and 5 together knock about 25% off the running time.

I think that (5) alone causes most of the difference, because + is 
evaluated at runtime. Implicit string concatenation is evaluated at compile 
time. Just write your string in consecutive lines, without the 
concatenation operator. It is *much* faster. The way it is written now you 
have the following operations:

a) create a new string object for the first part of the string;
b) create a new string object for the second part of the string;
c) call the string concatenation operator method, that returns a third 
string object. This involves some memory copies that are being done all the 
time.

>6. If whatever you intend to do with the data involves filtering it on the
>first field or two, then using "xxx...".split(' ', 1) is very much faster
>than splitting up all the fields. This can reduce the time by two thirds
>easily.

Good hint. This also makes difference - don't do all the work if you really 
need only part of it.

>7. Use Perl, or C, or whatever else takes your fancy if speed is that
>critical.

You could also try to use the re module. This has *several* advantages. The 
code is highly optimized and is Unicode aware. You can in a single step 
*both* break the string and check if the parts are valid, so anything 
invalid is automatically detected. Something like should do the trick:

 >>> import re
 >>> r = re.compile(r'(\d+\.\d+\.\d+\.\d+)\s+(\d+\.\d+\.\d+\.\d+)\s+' \
... 
r'(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+' \
...                r'(\d\d:\d\d:\d\d\.\d\d\d\d)\s+' \
...                r'(\d\d:\d\d:\d\d\.\d\d\d\d)\s+' \
...                r'(\d+)\s+(\d+)')
 >>> s = '000.000.000.000 000.000.000.000 6 1064 80  54 54 1 1 
14:00:00.8094 14:00:00.8908 1 2'
 >>> re.match(r, s).groups()
('000.000.000.000', '000.000.000.000', '6', '1064', '80', '54', '54', '1', 
'1', '14:00:00.8094', '14:00:00.8908', '1', '2')
 >>>

Remember that:

- there are several ways to write regular expressions for the same 
expression. Some may be faster, some will be easier to read; some will be 
safer (catching more mistakes), and others more forgiving.
- remember to keep the re.compile out of any loop. It takes some time to 
compile the expression.
- if you want to get only part of the line you can make a simpler expression.

Carlos Ribeiro