list as parameter for the split function

Hello everyone, I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator. Here is the syntax I considered for this :
'00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ]) ['00', '02', '34', '452', '00', '02', '37', '927']
Is it a relevant idea ? What do you think about it ? Regards, Niilos.

import re parts = re.split(':|(-->)|,', '00:02:34...') On Mon, Sep 28, 2015 at 5:10 PM, Niilos <niilos@gmx.com> wrote:
Hello everyone,
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator.
Here is the syntax I considered for this :
'00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ]) ['00', '02', '34', '452', '00', '02', '37', '927']
Is it a relevant idea ? What do you think about it ?
Regards, Niilos. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/

Really, you could also just use re.match: import re pat = re.compile(r'(\d\d):(\d\d):(\d\d),(\d{3}) --> (\d\d):(\d\d):(\d\d),(\d{3})') def parse(string): return pat.match(string) ... print(parse('00:02:34,452 --> 00:02:37,927')) # prints ['00', '02', '34', '452', '00', '02', '37', '927'] That way, if the input is invalid, `None` will be returned, so you have free error checking (sort of). On Mon, Sep 28, 2015 at 5:23 PM, Ryan Gonzalez <rymg19@gmail.com> wrote:
import re parts = re.split(':|(-->)|,', '00:02:34...')
On Mon, Sep 28, 2015 at 5:10 PM, Niilos <niilos@gmx.com> wrote:
Hello everyone,
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator.
Here is the syntax I considered for this :
'00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ]) ['00', '02', '34', '452', '00', '02', '37', '927']
Is it a relevant idea ? What do you think about it ?
Regards, Niilos. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/
-- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/

On 28 September 2015 at 23:10, Niilos <niilos@gmx.com> wrote:
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it.
You can use re.split:
re.split(r':|,| --> ', '00:02:34,452 --> 00:02:37,927') ['00', '02', '34', '452', '00', '02', '37', '927']
Paul

On Tue, Sep 29, 2015 at 8:10 AM, Niilos <niilos@gmx.com> wrote:
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator.
Here is the syntax I considered for this :
'00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ]) ['00', '02', '34', '452', '00', '02', '37', '927']
Is it a relevant idea ? What do you think about it ?
Two possibilities: 1) Replace all separators with the same one. '00:02:34,452 --> 00:02:37,927'.replace(",",":").replace(" --> ",":").split(":") 2) Use a regular expression. re.split(":|,| --> ",'00:02:34,452 --> 00:02:37,927') # or working the other way: find all the digit strings re.findall("[0-9]+",'00:02:34,452 --> 00:02:37,927') You could also consider a more full parser; presumably splitting into strings is just the first step. I don't have anything handy in Python, but there would be ways of doing the whole thing in less steps. ChrisA

On Sep 28, 2015, at 15:10, Niilos <niilos@gmx.com> wrote:
Hello everyone,
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator.
As a side note, a list is not an Iterator. It's an iterable, but an Iterator is a special kind of iterable that only allows one pass, which is definitely not what you want here. In fact, what you probably want is a sequence (or maybe just a container, since the only thing you want to do is test "in"). Also, the way you've defined this ("each time its substring is in the iterator") is either ambiguous, or inherently expensive, depending on how you read it. And once you work out what you actually mean, it's hard to express it better than as a regular expression, which is why half a dozen people jumped to that answer.

On Tue, Sep 29, 2015 at 12:10:34AM +0200, Niilos wrote:
Hello everyone,
I was wondering how to split a string with multiple separators. For instance, if I edit some subtitle file and I want the string '00:02:34,452 --> 00:02:37,927' to become ['00', '02', '34', '452', '00', '02', '37', '927'] I have to use split too much time and I didn't find a "clean" way to do it. I imagined the split function with an iterator as parameter. The string would be split each time its substring is in the iterator.
Here is the syntax I considered for this :
'00:02:34,452 --> 00:02:37,927'.split([ ':', ' --> ', ',' ]) ['00', '02', '34', '452', '00', '02', '37', '927']
Is it a relevant idea ? What do you think about it ?
Quite a few string methods take multiple arguments, e.g.: py> "spam".startswith(("a", "+", "sp")) True and I've often wished that split would be one of them. The substring argument could accept a string (as it does now) or a tuple of strings. There are other solutions, but they have issues: (1) Writing your own custom string mini-parser and getting it right is harder than it sounds. Certainly its not simple enough to reinvent this particular tool each time you want it. (2) Using replace to change all the substrings to one: py> text = "aaa,bbb ccc;ddd,eee fff" py> text.replace(",", " ").replace(";", " ").split() ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'] works well enough for simple cases, but if you have a lot of text, having to call replace multiple times can be expensive. (3) Using a regular expression is probably the "right" answer, at least from a comp sci theorectical perspective. This is precisely the sort of thing that regexes are designed for. Unfortunately, regex syntax is itself a programming language[1], and a particularly cryptic and unforgiving one, so even quite experienced coders can have trouble. At first it seems easy: py> re.split(r";|-|~", "aaa~bbb-ccc;ddd;eee") ['aaa', 'bbb', 'ccc', 'ddd', 'eee'] but then seemingly minor changes makes it misbehave: py> re.split(r";|-|^", "aaa^bbb-ccc^ddd;eee") ['aaa^bbb', 'ccc^ddd', 'eee'] py> re.split(r";|-|.", "aaa.bbb-ccc;ddd;eee") ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] The solution is to escape the metacharacters, but people who aren't familiar with regexes won't necessarily know which they are. So really, in my opinion, there is no good built-in solution to the *general* problem of splitting a string on multiple arbitrary substrings. Perhaps str.split can act as an interface to the re module, automatically escaping the substrings: # Pseudo-implimentation def split(self, substrings, maxsplit=None): if isinstance(substrings, str): # use the current implementation ... elif isinstance(substrings, tuple): regex = '|'.join(re.escape(s) for s in substrings) return re.split(regex, self, maxsplit) [1] Albeit not a Turing Complete one, at least not Python's version. -- Steve

On Mon, Sep 28, 2015 at 8:43 PM, Steven D'Aprano <steve@pearwood.info> wrote:
(3) Using a regular expression is probably the "right" answer, at least from a comp sci theorectical perspective. This is precisely the sort of thing that regexes are designed for. Unfortunately, regex syntax is itself a programming language[1], and a particularly cryptic and unforgiving one, so even quite experienced coders can have trouble.
indeed -- we all know the old maxim: "I had a problem, and thought "I know, I'll use regular expressions" -- now I have two problems. And the Python "obvious way to do it" has always been for simple string manipulation, see if what you need is in a string method before you bring out the big guns of REs After all, if "use REs" was the answer to simple string manipulation problems, the string object would have a lot fewer methods. So: I've frequently had this use-case, too -- it would be a nice enhancement that would had substantial utility to strings. Whether it used an re under the hood or not should be an implementation detail. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

+1 for the feature, given as a tuple. Agreed with original, parent, and grandparent posts, have encountered this numerous times over the years. Reaching for the re module and docs.python and/or stackoverflow to split a string (with two delimiters) feels like swatting a fly with a sledge-hammer. ;) Conversely, I've not encountered the need as often with .startswith, which does support it. -Mike
participants (9)
-
Andrew Barnert
-
Chris Angelico
-
Chris Barker
-
Emile van Sebille
-
Mike Miller
-
Niilos
-
Paul Moore
-
Ryan Gonzalez
-
Steven D'Aprano