str.substring_after and str.substring_before as in Kotlin
Dear pythonistas, I would really like having a function like String.substringAfter(sep) and String.substringBefore(sep) as in Kotlin. fun String.substringAfter( delimiter: String, missingDelimiterValue: String = this ): String substringAfter/substringBefore takes a delimiter, and keeps everything after the first occurrence of the delimiter. This could be useful for simple HTML parsing for example. It also take a second argument to use as value when the seperator is not found (defaults to the whole input String). ---- A lot of code in the wild currently uses str.split(sep) or manual indexing, as exemplified by the two top answers here: https://stackoverflow.com/q/12572362/2111778 - str.split(sep) runs into issues if the separator occurs repeatedly: substringBefore = lambda s, sep: s.split(sep)[0] # Don't use this. substringAfter = lambda s, sep: s.split(sep)[1] # Don't use this. substringAfter = lambda s, sep: s.split(sep)[-1] # "Fixes" IndexError, but still don't use this. This will bite the user if he doesn't know about the second argument which limits the number of splits: substringBefore = lambda s, sep: s.split(sep, 1)[0] substringAfter = lambda s, sep: s.split(sep, 1)[1] # IndexError if nonexistent substringAfter = lambda s, sep: s.split(sep, 1)[-1] # original string if nonexistent I have regrettably even written code like this before: substringAfter = lambda s, sep: sep.join(s.split(sep)[1:]) - Another approach uses indexing: substringAfter = lambda s, sep: s[s.index(sep) + len(sep):] This works okay as a separate function but cannot be inlined very well due to 'sep' and 's' both being used twice: after = s[s.index('some separator string') + len('some separator string'):] Plus it's too much cognitive load for a simple "substring after" operation. - Regexes can be used, but these are too powerful and can introduce subtle bugs. Java has this problem with their String#split method which takes a regex String. So while in Python '8.8..'.split('.') == ['8', '8', '', ''], in Java you have to doubly escape the dot using "8.8..".split("\\."), EXCEPT that will result in a String[] of length 2 rather than 4, so you ACTUALLY have to do "8.8..".split("\\.", -1). So regexes have huge potential for confusing users. Another example with a substringAfter/substringBefore use case: >>> import re >>> re.findall('\(.*\)', 'whitespace bad (haha) (not really)') ['(haha) (not really)'] It is not obvious how to fix this: >>> re.findall("\(.*?\)", 'whitespace bad (haha) (not really)') ['(haha)', '(not really)'] - The best alternative currently is str.partition(sep) but such code is not very readable, plus most users do not know about it, as proven by the StackOverflow link. Note that if not found this defaults to the original str for substringAfter, and an empty string for substringBefore: substringAfter = lambda s, sep: s.partition(sep)[2] substringBefore = lambda s, sep: s.partition(sep)[0] ---- If added to the <str> class, a typical use case would be something like this: bracketed = 'whitespace bad (haha) (not really)'.substringAfter('(').substringBefore(')') assert bracketed == 'haha' I find this code highly readable. Currently the best alternatives would be: bracketed = 'whitespace bad (haha) (not really)'.partition('(')[2].partition(')')[0] bracketed = 'whitespace bad (haha) (not really)'.split('(', 1)[-1].split(')', 1)[0] bracketed = 'whitespace bad (haha) (not really)'.split('(', 1)[1].split(')', 1)[0] All of these are not very readable, the latter ones even has 4 seemingly-random integer constants. Even while writing this I got them wrong multiple times. Plus they differ in behavior: The first one returns an empty string if the separators are not found, the second one returns the original string instead, the third one throws an IndexError (but only if the first separator is missing, not if the second one is, yikes). Monkey-patching the <str> class to add str.substringAfter(sep) on the user side is also not possible as it is a C type. I think this would fit well in Python (apart from the camelCase ;)) because it would complement removeprefix/removesuffix which are being added in 3.9 already. Plus I do use substringAfter/substringBefore in Kotlin all the time. One might even think about a substringBetween to be honest: bracketed = 'whitespace bad (haha) (not really)'.substringBetween('(', ')') assert bracketed == 'haha' As an alternative, I think str.partition(sep) could be changed to return a NamedTuple rather than a simple tuple. This should be interoperable with pre-existing code. This could enable calls as follows: substringAfter = lambda s, sep: s.partition(sep).after substringBefore = lambda s, sep: s.partition(sep).before bracketed = 'whitespace bad (haha) (not really)'.partition('(').after.partition(')').before What do you think? Greetings Jan
On Sun, Apr 4, 2021 at 1:10 AM <janfrederik.konopka@gmail.com> wrote:
- The best alternative currently is str.partition(sep) but such code is not very readable, plus most users do not know about it, as proven by the StackOverflow link. Note that if not found this defaults to the original str for substringAfter, and an empty string for substringBefore: substringAfter = lambda s, sep: s.partition(sep)[2] substringBefore = lambda s, sep: s.partition(sep)[0]
If the biggest problem is "people don't know about it", then adding a new function isn't going to solve that :)
As an alternative, I think str.partition(sep) could be changed to return a NamedTuple rather than a simple tuple.
Not sure what the performance implications would be, but that seems pretty reasonable. But what I'd recommend is using unpacking: before, sep, after = s.partition(sep) Job done. Everything's clear, you can use "if s:" to find out if the separator was present or not, and you can define the precise semantics on not-found very easily. ChrisA
participants (2)
-
Chris Angelico
-
janfrederik.konopka@gmail.com