[Python-ideas] str.split with multiple individual split characters

Stefan Behnel stefan_ml at behnel.de
Mon Feb 28 11:57:36 CET 2011


Steven D'Aprano, 28.02.2011 11:23:
> Guido van Rossum wrote:
>> It's so easy to do this using re.split() that it's not worth the added
>> complexity in str.split().
>
> Easy, but slow. If performance is important, it looks to me like re.split
> is the wrong solution. Using Python 3.1:
>
>
>  >>> from re import split
>  >>> def split_str(s, *args): # quick, dirty and inefficient multi-split
>  ...     for a in args[1:]:
>  ...         s = s.replace(a, args[0])
>  ...     return s.split(args[0])
>  ...
>  >>> text = "abc.d-ef_g:h;ijklmn+opqrstu|vw-x_y.z"*1000
>  >>> assert split(r'[.\-_:;+|]', text) == split_str(text, *'.-_:;+|')
>  >>>
>  >>> from timeit import Timer
>  >>> t1 = Timer("split(r'[.\-_:;+|]', text)",
>  ... "from re import split; from __main__ import text")
>  >>> t2 = Timer("split_str(text, *'.-_:;+|')",
>  ... "from __main__ import split_str, text")
>  >>>
>  >>> min(t1.repeat(number=10000, repeat=5))
>  72.31230521202087
>  >>> min(t2.repeat(number=10000, repeat=5))
>  17.375113010406494

You forgot to do the precompilation. Here's what I get:

    >>> t1 = Timer("split(text)", "import re; from __main__ import text; \
    ... split=re.compile(r'[.\-_:;+|]').split")
    >>> min(t1.repeat(number=1000, repeat=3))
    3.9842870235443115
    >>> min(t2.repeat(number=1000, repeat=3))
    0.9261999130249023

Still a factor of 4, using Py3.2. Anyone wants to try it with the 
alternative regex packages?

Stefan




More information about the Python-ideas mailing list