[Python-ideas] list as parameter for the split function
Steven D'Aprano
steve at pearwood.info
Tue Sep 29 05:43:18 CEST 2015
On Tue, Sep 29, 2015 at 12:10:34AM +0200, Niilos wrote:
> Hello everyone,
>
> I was wondering how to split a string with multiple separators.
> For instance, if I edit some subtitle file and I want the string
> '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452',
> '00', '02', '37', '927'] I have to use split too much time and I didn't
> find a "clean" way to do it.
> I imagined the split function with an iterator as parameter. The string
> would be split each time its substring is in the iterator.
>
> Here is the syntax I considered for this :
>
> >>> '00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ])
> ['00', '02', '34', '452', '00', '02', '37', '927']
>
> Is it a relevant idea ? What do you think about it ?
Quite a few string methods take multiple arguments, e.g.:
py> "spam".startswith(("a", "+", "sp"))
True
and I've often wished that split would be one of them. The substring
argument could accept a string (as it does now) or a tuple of strings.
There are other solutions, but they have issues:
(1) Writing your own custom string mini-parser and getting it right is
harder than it sounds. Certainly its not simple enough to reinvent this
particular tool each time you want it.
(2) Using replace to change all the substrings to one:
py> text = "aaa,bbb ccc;ddd,eee fff"
py> text.replace(",", " ").replace(";", " ").split()
['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff']
works well enough for simple cases, but if you have a lot of text,
having to call replace multiple times can be expensive.
(3) Using a regular expression is probably the "right" answer, at least
from a comp sci theorectical perspective. This is precisely the sort of
thing that regexes are designed for. Unfortunately, regex syntax is
itself a programming language[1], and a particularly cryptic and
unforgiving one, so even quite experienced coders can have trouble.
At first it seems easy:
py> re.split(r";|-|~", "aaa~bbb-ccc;ddd;eee")
['aaa', 'bbb', 'ccc', 'ddd', 'eee']
but then seemingly minor changes makes it misbehave:
py> re.split(r";|-|^", "aaa^bbb-ccc^ddd;eee")
['aaa^bbb', 'ccc^ddd', 'eee']
py> re.split(r";|-|.", "aaa.bbb-ccc;ddd;eee")
['', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '']
The solution is to escape the metacharacters, but people who aren't
familiar with regexes won't necessarily know which they are.
So really, in my opinion, there is no good built-in solution to the
*general* problem of splitting a string on multiple arbitrary
substrings. Perhaps str.split can act as an interface to the re module,
automatically escaping the substrings:
# Pseudo-implimentation
def split(self, substrings, maxsplit=None):
if isinstance(substrings, str):
# use the current implementation
...
elif isinstance(substrings, tuple):
regex = '|'.join(re.escape(s) for s in substrings)
return re.split(regex, self, maxsplit)
[1] Albeit not a Turing Complete one, at least not Python's version.
--
Steve
More information about the Python-ideas
mailing list