[Python-ideas] list as parameter for the split function

Steven D'Aprano steve at pearwood.info
Tue Sep 29 05:43:18 CEST 2015

On Tue, Sep 29, 2015 at 12:10:34AM +0200, Niilos wrote:
> Hello everyone,
> I was wondering how to split a string with multiple separators.
> For instance, if I edit some subtitle file and I want the string 
> '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', 
> '00', '02', '37', '927'] I have to use split too much time and I didn't 
> find a "clean" way to do it.
> I imagined the split function with an iterator as parameter. The string 
> would be split each time its substring is in the iterator.
> Here is the syntax I considered for this :
> >>> '00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ])
> ['00', '02', '34', '452', '00', '02', '37', '927']
> Is it a relevant idea ? What do you think about it ?

Quite a few string methods take multiple arguments, e.g.:

py> "spam".startswith(("a", "+", "sp"))

and I've often wished that split would be one of them. The substring 
argument could accept a string (as it does now) or a tuple of strings.

There are other solutions, but they have issues:

(1) Writing your own custom string mini-parser and getting it right is 
harder than it sounds. Certainly its not simple enough to reinvent this 
particular tool each time you want it.

(2) Using replace to change all the substrings to one:

py> text = "aaa,bbb ccc;ddd,eee fff"
py> text.replace(",", " ").replace(";", " ").split()
['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff']

works well enough for simple cases, but if you have a lot of text, 
having to call replace multiple times can be expensive.

(3) Using a regular expression is probably the "right" answer, at least 
from a comp sci theorectical perspective. This is precisely the sort of 
thing that regexes are designed for. Unfortunately, regex syntax is 
itself a programming language[1], and a particularly cryptic and 
unforgiving one, so even quite experienced coders can have trouble.

At first it seems easy:

py> re.split(r";|-|~", "aaa~bbb-ccc;ddd;eee")
['aaa', 'bbb', 'ccc', 'ddd', 'eee']

but then seemingly minor changes makes it misbehave:

py> re.split(r";|-|^", "aaa^bbb-ccc^ddd;eee")
['aaa^bbb', 'ccc^ddd', 'eee']

py> re.split(r";|-|.", "aaa.bbb-ccc;ddd;eee")
['', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '']

The solution is to escape the metacharacters, but people who aren't 
familiar with regexes won't necessarily know which they are.

So really, in my opinion, there is no good built-in solution to the 
*general* problem of splitting a string on multiple arbitrary 
substrings. Perhaps str.split can act as an interface to the re module, 
automatically escaping the substrings:

# Pseudo-implimentation
def split(self, substrings, maxsplit=None):
    if isinstance(substrings, str):
        # use the current implementation
    elif isinstance(substrings, tuple):
        regex = '|'.join(re.escape(s) for s in substrings)
        return re.split(regex, self, maxsplit)

[1] Albeit not a Turing Complete one, at least not Python's version.


More information about the Python-ideas mailing list