Mailman 3 Complicate str methods - Python-ideas

Terry Reedy

February 2018

3:43 p.m.

On 2/3/2018 5:04 PM, Franklin? Lee wrote:

Let s be a str. I propose to allow these existing str methods to take params in new forms.

Thanks for the honest title. As you sort of indicate, these can all be done with re module. However, you imply loops are needed besides, which is mostly not true. Your complications mostly translate to existing calls and hence are not needed. Perhaps 'Regular Expression HOWTO' could use more examples, or even a section on generalizing string methinds. Perhaps the string method doc needs suggestion to use re for multiple string args and references to the re howto. Please consider taking a look at both.

...

...
...
import re

...

s.replace(old, new): Allow passing in a collection of olds.

...

...
...
re.sub('Franklin|Lee', 'user', 'Franklin? Lee') 'user? user'

Remembering the name change is a nuisance

...

Allow passing in a single argument, a mapping of olds to news.

This needs to be a separate function, say 'dictsub', that joins the keys with '|' and calls re.sub with a function that does the lookup as the 2nd parameter. This would be a nice example for the howto. As you noted, this is generalization of str.translate, and might be proposed as a new re module function.

...

Allow the olds in the mapping to be tuples of strings.

A minor addition to dictsub.

...

s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.

re.split is already more flexible than non-whitespace str.split and str.partition combined.

...

...
...
re.split('a|e|i|o|u', 'Franklin? Lee') ['Fr', 'nkl', 'n? L', '', ''] re.split('(a|e|i|o|u)', 'Franklin? Lee') # multiple partition ['Fr', 'a', 'nkl', 'i', 'n? L', 'e', '', 'e', ''] re.split('(a|e|i|o|u)', 'Franklin? Lee', 1) # single partition ['Fr', 'a', 'nklin? Lee']

re.split, and hence str.rsplit(collection) are very sensible.

...

s.startswith, s.endswith: Allow argument to be a collection of strings.

bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, with the advantage that the actual match is available, and I think that one would nearly always want to know that match.

...

...
...
re.match('a|e|i|o|u', 'Franklin? Lee') re.match('f|F', 'Franklin? Lee') <re.Match object; span=(0, 1), match='F'>

re.search with '^' at the beginning or '$' at the end covers both proposals, with the added flexibility of using MULTILINE mode to match at the beginning or end of lines within the string.

...

s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence.

Comments above apply. re.search tells you which string matched as well as where. bool(re.search) is 'x in s'. re.findall and re.finditer give much more info than merely a count ('sum(bool(re.finditer))').

...

s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters.

To avoid this, use re.sub with ^ or $ anchor and '' replacement.

...

...
...
re.sub('(Frank|Lee)$', '', 'Franklin? Lee') 'Franklin? '

...

These new forms can be optimized internally, as a search for multiple candidate substrings can be more efficient than searching for one at a time.

This is what re does with 's1|s2|...|sn' patterns.

...

https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-strin...

The most significant change is on .replace. The others are simple enough to simulate with a loop or something.

No loops needed.

...

It is harder to make multiple simultaneous replacements using one .replace at a time, because previous replacements can form new things that look like replaceables.

This problem exists for single string replacement also. The standard solution is to not backtrack and not do overlapping replacements.

...

The easiest Python solution is to use regex

My claim above is that this is sufficient for all by one case, which should be a new function anyway.

...

or install some package, which uses (if you're lucky) regex or (if unlucky) doesn't simulate simultaneous replacements. (If possible, just use str.translate.)

I suppose .split on multiple separators is also annoying to simulate. The two-argument form of .split may be even more of a burden, though I don't know when a limited multiple-separator split is useful. The current best solution is, like before, to use regex, or install a package and hope for the best.

-- Terry Jan Reedy

Reply

Sign in to reply online Use email software

Steven D'Aprano

5:09 p.m.

On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote:

...

Indeed. This is not Perl and "just use a regex" is not a close fit to the culture of Python. Regexes are a completely separate mini-language, and one which is the opposite of Pythonic. Instead of "executable pseudo-code", regexes are excessively terse and cryptic once you get past the simple examples. Doing anything complicated using regexes is painful. Even Larry Wall has criticised regex syntax for choosing poor defaults and information density. (Rarely used symbols get a single character, while frequently needed symbols are coded as multiple characters, so Perlish syntax has the worst of both worlds: too terse for casual users, too verbose for experts, hard to maintain for everyone.) Any serious programmer should have at least a passing familiarity with regexes. They are ubiquitous, and useful, especially as a common mini-language for user-specified searching. But I consider regexes to be the fall-back for when Python doesn't support the kind of string matching operation I need, not the primary solution. I would never write: re.match('start', text) re.search('spam', text) when text.startswith('start') text.find('spam') will do. I think this proposal to add more power to the string methods is worth some serious consideration. -- Steve

Reply

Sign in to reply online Use email software

Franklin? Lee

7:36 p.m.

On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy <tjreedy@udel.edu> wrote:

...

My point wasn't that they _needed_ loops, but that they were simple enough to write (correctly) using loops, while simultaneous multiple .replace isn't. In any case, due to the backtracking algorithm used by re, a real specialized implementation can be much more efficient for the functions I listed, and will add value.

...

Nah, I'm fine. I can take the time to write inefficient simulations, or find them online. I proposed this not because I myself need it in the language, but because I think it's useful for many, yet easy to get wrong. It's the same with `sort`. In my design notes for my own toolkit, I also allow regular expressions as "olds. I didn't add that feature, since it'd cause a dependency.

...

Three issues: 1. As Chris Angelico pointed out, the input strings need to be escaped. 2. The input strings should also be reverse sorted by length. Longer strings should be higher priority than shorter strings, or else 'foobar' will never outbid 'foo'. 3. The substitution needs to be escaped (which is probably why it has a different name), using repl.replace('\\', r'\\'). As the docs note, using repl.escape here is wrong (at least before 3.7). Like I said, easy to get wrong.

...

If this function is added to re, it should also allow patterns as keys, and we may also want to add re.compile(collection_of_strings_and_patterns).

...

Really? re generally uses backtracking, not a DFA. Timing indicates that it is NOT using an efficient algorithm. import re def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__): return any(map(plusone, map(haystack.find, needles))) def findsub_re(haystack, needles, bool=bool, map=map, find=re.search, esc=re.escape): pattern = '|'.join(map(esc, needles)) return bool(find(pattern, haystack)) def findsub_re_cached(haystack, regex, bool=bool): return bool(regex.search(haystack)) n = 1000 haystack = 'a'*n + 'b' needles = ['a'*i + 'bb' for i in range(1, n)] needles = sorted(needles, key=len, reverse=True) regex = re.compile('|'.join(map(re.escape, needles))) %timeit findsub(haystack, needles) #=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit findsub_re(haystack, needles) #=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit findsub_re_cached(haystack, regex) #=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Here, the naive approach using basic string operations is over 300x faster than the re approach. Even a simple pure Python tree approach is faster. END = '' # Tree-based multiple string searches. def make_tree(needles): """Creates a tree such that each node maps characters to values. The tree is not optimized. """ root = {} for needle in needles: cur = root for c in needle: try: cur = cur[c] except KeyError: cur[c] = cur = {} #curious cur[END] = needle return root def findsub_tree(haystack, needles): tree = make_tree(needles) nodes = [tree] for c in haystack: nodes = [n[c] for n in nodes if c in n] #NOT OPTIMAL if any(END in n for n in nodes): return True nodes.append(tree) return False %timeit findsub_tree(haystack, needles) #=> 95.1 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Precomputing the tree saves about 80 ms of that time.

...

I don't know what you mean by the problem existing for single-string replacement, nor how your solution solves the problem. I'm talking about simulating multiple simultaneous replacements with the .replace method. .replace doesn't see previous replacements as candidates for replacement. However, using it to simulate multiple simultaneous replacements will fail if the new values form something that looks like the old values.

Reply

Sign in to reply online Use email software

Sven R. Kunze

7:31 a.m.

Same here. On 07.02.2018 22:57, Mike Miller wrote:

...

Reply

Sign in to reply online Use email software

Franklin? Lee

2:45 a.m.

On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take

...

The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

Reply

Sign in to reply online Use email software

Serhiy Storchaka

10:04 a.m.

08.02.18 12:45, Franklin? Lee пише:

...

Yes, there is a special optimization in re here. It isn't free, you need to spend some time for preparing it. You need a special object that keeps an optimized representation for faster search. This makes it very unlikely be used in str, because you need either spend the time for compilation on every search, or use some kind of caching, which is not free too, adds complexity and increases memory consumption. Note also in case of re the compiler is implemented in Python. This reduces the complexity. Patches that add optimization for other common cases are welcomed.

Reply

Sign in to reply online Use email software

INADA Naoki

9:14 a.m.

I think it shouldn't be str's method. They should be separate class to reuse internal tree. There are some Aho Corasick implementation on PyPI. As far as I know, AC is longest match. On the other hand, Go's replacer (it's trie based too) is:

...

Replacements are performed in order, without overlapping matches. https://golang.org/pkg/strings/#NewReplacer

On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:

...

Let s be a str. I propose to allow these existing str methods to take params in new forms.

s.replace(old, new): Allow passing in a collection of olds. Allow passing in a single argument, a mapping of olds to news. Allow the olds in the mapping to be tuples of strings.

s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.

s.startswith, s.endswith: Allow argument to be a collection of strings.

s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence.

s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters.

These new forms can be optimized internally, as a search for multiple candidate substrings can be more efficient than searching for one at a time. See https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-strin...

The most significant change is on .replace. The others are simple enough to simulate with a loop or something. It is harder to make multiple simultaneous replacements using one .replace at a time, because previous replacements can form new things that look like replaceables. The easiest Python solution is to use regex or install some package, which uses (if you're lucky) regex or (if unlucky) doesn't simulate simultaneous replacements. (If possible, just use str.translate.)

I suppose .split on multiple separators is also annoying to simulate. The two-argument form of .split may be even more of a burden, though I don't know when a limited multiple-separator split is useful. The current best solution is, like before, to use regex, or install a package and hope for the best.

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- INADA Naoki <songofacandy@gmail.com>

Reply

Sign in to reply online Use email software

Terry Reedy

February 2018

11:43 p.m.

On 2/3/2018 5:04 PM, Franklin? Lee wrote:

...

Let s be a str. I propose to allow these existing str methods to take params in new forms.

Thanks for the honest title. As you sort of indicate, these can all be done with re module. However, you imply loops are needed besides, which is mostly not true. Your complications mostly translate to existing calls and hence are not needed. Perhaps 'Regular Expression HOWTO' could use more examples, or even a section on generalizing string methinds. Perhaps the string method doc needs suggestion to use re for multiple string args and references to the re howto. Please consider taking a look at both.

...

...
...
import re

...

s.replace(old, new): Allow passing in a collection of olds.

...

...
...
re.sub('Franklin|Lee', 'user', 'Franklin? Lee') 'user? user'

Remembering the name change is a nuisance

...

Allow passing in a single argument, a mapping of olds to news.

This needs to be a separate function, say 'dictsub', that joins the keys with '|' and calls re.sub with a function that does the lookup as the 2nd parameter. This would be a nice example for the howto. As you noted, this is generalization of str.translate, and might be proposed as a new re module function.

...

Allow the olds in the mapping to be tuples of strings.

A minor addition to dictsub.

...

s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.

re.split is already more flexible than non-whitespace str.split and str.partition combined.

...

...
...
re.split('a|e|i|o|u', 'Franklin? Lee') ['Fr', 'nkl', 'n? L', '', ''] re.split('(a|e|i|o|u)', 'Franklin? Lee') # multiple partition ['Fr', 'a', 'nkl', 'i', 'n? L', 'e', '', 'e', ''] re.split('(a|e|i|o|u)', 'Franklin? Lee', 1) # single partition ['Fr', 'a', 'nklin? Lee']

re.split, and hence str.rsplit(collection) are very sensible.

...

s.startswith, s.endswith: Allow argument to be a collection of strings.

bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, with the advantage that the actual match is available, and I think that one would nearly always want to know that match.

...

...
...
re.match('a|e|i|o|u', 'Franklin? Lee') re.match('f|F', 'Franklin? Lee') <re.Match object; span=(0, 1), match='F'>

re.search with '^' at the beginning or '$' at the end covers both proposals, with the added flexibility of using MULTILINE mode to match at the beginning or end of lines within the string.

...

s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence.

Comments above apply. re.search tells you which string matched as well as where. bool(re.search) is 'x in s'. re.findall and re.finditer give much more info than merely a count ('sum(bool(re.finditer))').

...

s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters.

To avoid this, use re.sub with ^ or $ anchor and '' replacement.

...

...
...
re.sub('(Frank|Lee)$', '', 'Franklin? Lee') 'Franklin? '

...

These new forms can be optimized internally, as a search for multiple candidate substrings can be more efficient than searching for one at a time.

This is what re does with 's1|s2|...|sn' patterns.

...

https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-strin...

The most significant change is on .replace. The others are simple enough to simulate with a loop or something.

No loops needed.

...

It is harder to make multiple simultaneous replacements using one .replace at a time, because previous replacements can form new things that look like replaceables.

This problem exists for single string replacement also. The standard solution is to not backtrack and not do overlapping replacements.

...

The easiest Python solution is to use regex

My claim above is that this is sufficient for all by one case, which should be a new function anyway.

...

or install some package, which uses (if you're lucky) regex or (if unlucky) doesn't simulate simultaneous replacements. (If possible, just use str.translate.)

I suppose .split on multiple separators is also annoying to simulate. The two-argument form of .split may be even more of a burden, though I don't know when a limited multiple-separator split is useful. The current best solution is, like before, to use regex, or install a package and hope for the best.

-- Terry Jan Reedy

Reply

Sign in to reply online Use email software

Steven D'Aprano

1:09 a.m.

On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote:

...

Indeed. This is not Perl and "just use a regex" is not a close fit to the culture of Python. Regexes are a completely separate mini-language, and one which is the opposite of Pythonic. Instead of "executable pseudo-code", regexes are excessively terse and cryptic once you get past the simple examples. Doing anything complicated using regexes is painful. Even Larry Wall has criticised regex syntax for choosing poor defaults and information density. (Rarely used symbols get a single character, while frequently needed symbols are coded as multiple characters, so Perlish syntax has the worst of both worlds: too terse for casual users, too verbose for experts, hard to maintain for everyone.) Any serious programmer should have at least a passing familiarity with regexes. They are ubiquitous, and useful, especially as a common mini-language for user-specified searching. But I consider regexes to be the fall-back for when Python doesn't support the kind of string matching operation I need, not the primary solution. I would never write: re.match('start', text) re.search('spam', text) when text.startswith('start') text.find('spam') will do. I think this proposal to add more power to the string methods is worth some serious consideration. -- Steve

Reply

Sign in to reply online Use email software

Franklin? Lee

3:36 a.m.

On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy <tjreedy@udel.edu> wrote:

...

My point wasn't that they _needed_ loops, but that they were simple enough to write (correctly) using loops, while simultaneous multiple .replace isn't. In any case, due to the backtracking algorithm used by re, a real specialized implementation can be much more efficient for the functions I listed, and will add value.

...

Nah, I'm fine. I can take the time to write inefficient simulations, or find them online. I proposed this not because I myself need it in the language, but because I think it's useful for many, yet easy to get wrong. It's the same with `sort`. In my design notes for my own toolkit, I also allow regular expressions as "olds. I didn't add that feature, since it'd cause a dependency.

...

Three issues: 1. As Chris Angelico pointed out, the input strings need to be escaped. 2. The input strings should also be reverse sorted by length. Longer strings should be higher priority than shorter strings, or else 'foobar' will never outbid 'foo'. 3. The substitution needs to be escaped (which is probably why it has a different name), using repl.replace('\\', r'\\'). As the docs note, using repl.escape here is wrong (at least before 3.7). Like I said, easy to get wrong.

...

If this function is added to re, it should also allow patterns as keys, and we may also want to add re.compile(collection_of_strings_and_patterns).

...

Really? re generally uses backtracking, not a DFA. Timing indicates that it is NOT using an efficient algorithm. import re def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__): return any(map(plusone, map(haystack.find, needles))) def findsub_re(haystack, needles, bool=bool, map=map, find=re.search, esc=re.escape): pattern = '|'.join(map(esc, needles)) return bool(find(pattern, haystack)) def findsub_re_cached(haystack, regex, bool=bool): return bool(regex.search(haystack)) n = 1000 haystack = 'a'*n + 'b' needles = ['a'*i + 'bb' for i in range(1, n)] needles = sorted(needles, key=len, reverse=True) regex = re.compile('|'.join(map(re.escape, needles))) %timeit findsub(haystack, needles) #=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit findsub_re(haystack, needles) #=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit findsub_re_cached(haystack, regex) #=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Here, the naive approach using basic string operations is over 300x faster than the re approach. Even a simple pure Python tree approach is faster. END = '' # Tree-based multiple string searches. def make_tree(needles): """Creates a tree such that each node maps characters to values. The tree is not optimized. """ root = {} for needle in needles: cur = root for c in needle: try: cur = cur[c] except KeyError: cur[c] = cur = {} #curious cur[END] = needle return root def findsub_tree(haystack, needles): tree = make_tree(needles) nodes = [tree] for c in haystack: nodes = [n[c] for n in nodes if c in n] #NOT OPTIMAL if any(END in n for n in nodes): return True nodes.append(tree) return False %timeit findsub_tree(haystack, needles) #=> 95.1 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Precomputing the tree saves about 80 ms of that time.

...

I don't know what you mean by the problem existing for single-string replacement, nor how your solution solves the problem. I'm talking about simulating multiple simultaneous replacements with the .replace method. .replace doesn't see previous replacements as candidates for replacement. However, using it to simulate multiple simultaneous replacements will fail if the new values form something that looks like the old values.

Reply

Sign in to reply online Use email software

Sven R. Kunze

February 2018

7:31 a.m.

Same here. On 07.02.2018 22:57, Mike Miller wrote:

...

Reply

Sign in to reply online Use email software

Franklin? Lee

2:45 a.m.

On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take

...

The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

Reply

Sign in to reply online Use email software

Serhiy Storchaka

10:04 a.m.

08.02.18 12:45, Franklin? Lee пише:

...

Yes, there is a special optimization in re here. It isn't free, you need to spend some time for preparing it. You need a special object that keeps an optimized representation for faster search. This makes it very unlikely be used in str, because you need either spend the time for compilation on every search, or use some kind of caching, which is not free too, adds complexity and increases memory consumption. Note also in case of re the compiler is implemented in Python. This reduces the complexity. Patches that add optimization for other common cases are welcomed.

Reply

Sign in to reply online Use email software

INADA Naoki

February 2018

9:14 a.m.

I think it shouldn't be str's method. They should be separate class to reuse internal tree. There are some Aho Corasick implementation on PyPI. As far as I know, AC is longest match. On the other hand, Go's replacer (it's trie based too) is:

...

Replacements are performed in order, without overlapping matches. https://golang.org/pkg/strings/#NewReplacer

On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:

...

Let s be a str. I propose to allow these existing str methods to take params in new forms.

s.replace(old, new): Allow passing in a collection of olds. Allow passing in a single argument, a mapping of olds to news. Allow the olds in the mapping to be tuples of strings.

s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.

s.startswith, s.endswith: Allow argument to be a collection of strings.

s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence.

s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters.

These new forms can be optimized internally, as a search for multiple candidate substrings can be more efficient than searching for one at a time. See https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-strin...

The most significant change is on .replace. The others are simple enough to simulate with a loop or something. It is harder to make multiple simultaneous replacements using one .replace at a time, because previous replacements can form new things that look like replaceables. The easiest Python solution is to use regex or install some package, which uses (if you're lucky) regex or (if unlucky) doesn't simulate simultaneous replacements. (If possible, just use str.translate.)

I suppose .split on multiple separators is also annoying to simulate. The two-argument form of .split may be even more of a burden, though I don't know when a limited multiple-separator split is useful. The current best solution is, like before, to use regex, or install a package and hope for the best.

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- INADA Naoki <songofacandy@gmail.com>

Reply

Sign in to reply online Use email software

Complicate str methods

Franklin? Lee

Terry Reedy

Chris Angelico

Steven D'Aprano

Neil Girdhar

Franklin? Lee

Mike Miller

Sven R. Kunze

Serhiy Storchaka

Franklin? Lee

Steve Dower

Serhiy Storchaka

Franklin? Lee

INADA Naoki

Michel Desmoulin

Terry Reedy

Chris Angelico

Steven D'Aprano

Neil Girdhar

Franklin? Lee

Mike Miller

Sven R. Kunze

Serhiy Storchaka

Franklin? Lee

Steve Dower

Serhiy Storchaka

Franklin? Lee

INADA Naoki

Michel Desmoulin

tags

participants (11)