Complicate str methods

Let s be a str. I propose to allow these existing str methods to take params in new forms. s.replace(old, new): Allow passing in a collection of olds. Allow passing in a single argument, a mapping of olds to news. Allow the olds in the mapping to be tuples of strings. s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators. s.startswith, s.endswith: Allow argument to be a collection of strings. s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence. s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters. These new forms can be optimized internally, as a search for multiple candidate substrings can be more efficient than searching for one at a time. See https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-strin... The most significant change is on .replace. The others are simple enough to simulate with a loop or something. It is harder to make multiple simultaneous replacements using one .replace at a time, because previous replacements can form new things that look like replaceables. The easiest Python solution is to use regex or install some package, which uses (if you're lucky) regex or (if unlucky) doesn't simulate simultaneous replacements. (If possible, just use str.translate.) I suppose .split on multiple separators is also annoying to simulate. The two-argument form of .split may be even more of a burden, though I don't know when a limited multiple-separator split is useful. The current best solution is, like before, to use regex, or install a package and hope for the best.

On 2/3/2018 5:04 PM, Franklin? Lee wrote:
Let s be a str. I propose to allow these existing str methods to take params in new forms.
Thanks for the honest title. As you sort of indicate, these can all be done with re module. However, you imply loops are needed besides, which is mostly not true. Your complications mostly translate to existing calls and hence are not needed. Perhaps 'Regular Expression HOWTO' could use more examples, or even a section on generalizing string methinds. Perhaps the string method doc needs suggestion to use re for multiple string args and references to the re howto. Please consider taking a look at both.
import re
s.replace(old, new): Allow passing in a collection of olds.
re.sub('Franklin|Lee', 'user', 'Franklin? Lee') 'user? user'
Remembering the name change is a nuisance
Allow passing in a single argument, a mapping of olds to news.
This needs to be a separate function, say 'dictsub', that joins the keys with '|' and calls re.sub with a function that does the lookup as the 2nd parameter. This would be a nice example for the howto. As you noted, this is generalization of str.translate, and might be proposed as a new re module function.
Allow the olds in the mapping to be tuples of strings.
A minor addition to dictsub.
s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.
re.split is already more flexible than non-whitespace str.split and str.partition combined.
re.split, and hence str.rsplit(collection) are very sensible.
s.startswith, s.endswith: Allow argument to be a collection of strings.
bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, with the advantage that the actual match is available, and I think that one would nearly always want to know that match.
re.search with '^' at the beginning or '$' at the end covers both proposals, with the added flexibility of using MULTILINE mode to match at the beginning or end of lines within the string.
Comments above apply. re.search tells you which string matched as well as where. bool(re.search) is 'x in s'. re.findall and re.finditer give much more info than merely a count ('sum(bool(re.finditer))').
To avoid this, use re.sub with ^ or $ anchor and '' replacement.
re.sub('(Frank|Lee)$', '', 'Franklin? Lee') 'Franklin? '
This is what re does with 's1|s2|...|sn' patterns.
No loops needed.
This problem exists for single string replacement also. The standard solution is to not backtrack and not do overlapping replacements.
The easiest Python solution is to use regex
My claim above is that this is sufficient for all by one case, which should be a new function anyway.
-- Terry Jan Reedy

On Sun, Feb 4, 2018 at 10:43 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Picking up this one as an example, but this applies to all of them: the transformation you're giving here is dangerously flawed. If there are any regex special characters in the strings, this will either bomb with an exception, or silently do the wrong thing. The correct way to do it is (at least, I think it is): re.match("|".join(map(re.escape, strings)), testme) With that gotcha lurking in the wings, I think this should not be cavalierly dismissed with "just 'import re' and be done with it". Maybe put some recipes in the docs showing how to do this safely? ChrisA

On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote:
Indeed. This is not Perl and "just use a regex" is not a close fit to the culture of Python. Regexes are a completely separate mini-language, and one which is the opposite of Pythonic. Instead of "executable pseudo-code", regexes are excessively terse and cryptic once you get past the simple examples. Doing anything complicated using regexes is painful. Even Larry Wall has criticised regex syntax for choosing poor defaults and information density. (Rarely used symbols get a single character, while frequently needed symbols are coded as multiple characters, so Perlish syntax has the worst of both worlds: too terse for casual users, too verbose for experts, hard to maintain for everyone.) Any serious programmer should have at least a passing familiarity with regexes. They are ubiquitous, and useful, especially as a common mini-language for user-specified searching. But I consider regexes to be the fall-back for when Python doesn't support the kind of string matching operation I need, not the primary solution. I would never write: re.match('start', text) re.search('spam', text) when text.startswith('start') text.find('spam') will do. I think this proposal to add more power to the string methods is worth some serious consideration. -- Steve

On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy <tjreedy@udel.edu> wrote:
My point wasn't that they _needed_ loops, but that they were simple enough to write (correctly) using loops, while simultaneous multiple .replace isn't. In any case, due to the backtracking algorithm used by re, a real specialized implementation can be much more efficient for the functions I listed, and will add value.
Nah, I'm fine. I can take the time to write inefficient simulations, or find them online. I proposed this not because I myself need it in the language, but because I think it's useful for many, yet easy to get wrong. It's the same with `sort`. In my design notes for my own toolkit, I also allow regular expressions as "olds. I didn't add that feature, since it'd cause a dependency.
Three issues: 1. As Chris Angelico pointed out, the input strings need to be escaped. 2. The input strings should also be reverse sorted by length. Longer strings should be higher priority than shorter strings, or else 'foobar' will never outbid 'foo'. 3. The substitution needs to be escaped (which is probably why it has a different name), using repl.replace('\\', r'\\'). As the docs note, using repl.escape here is wrong (at least before 3.7). Like I said, easy to get wrong.
If this function is added to re, it should also allow patterns as keys, and we may also want to add re.compile(collection_of_strings_and_patterns).
Really? re generally uses backtracking, not a DFA. Timing indicates that it is NOT using an efficient algorithm. import re def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__): return any(map(plusone, map(haystack.find, needles))) def findsub_re(haystack, needles, bool=bool, map=map, find=re.search, esc=re.escape): pattern = '|'.join(map(esc, needles)) return bool(find(pattern, haystack)) def findsub_re_cached(haystack, regex, bool=bool): return bool(regex.search(haystack)) n = 1000 haystack = 'a'*n + 'b' needles = ['a'*i + 'bb' for i in range(1, n)] needles = sorted(needles, key=len, reverse=True) regex = re.compile('|'.join(map(re.escape, needles))) %timeit findsub(haystack, needles) #=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit findsub_re(haystack, needles) #=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit findsub_re_cached(haystack, regex) #=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Here, the naive approach using basic string operations is over 300x faster than the re approach. Even a simple pure Python tree approach is faster. END = '' # Tree-based multiple string searches. def make_tree(needles): """Creates a tree such that each node maps characters to values. The tree is not optimized. """ root = {} for needle in needles: cur = root for c in needle: try: cur = cur[c] except KeyError: cur[c] = cur = {} #curious cur[END] = needle return root def findsub_tree(haystack, needles): tree = make_tree(needles) nodes = [tree] for c in haystack: nodes = [n[c] for n in nodes if c in n] #NOT OPTIMAL if any(END in n for n in nodes): return True nodes.append(tree) return False %timeit findsub_tree(haystack, needles) #=> 95.1 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Precomputing the tree saves about 80 ms of that time.
I don't know what you mean by the problem existing for single-string replacement, nor how your solution solves the problem. I'm talking about simulating multiple simultaneous replacements with the .replace method. .replace doesn't see previous replacements as candidates for replacement. However, using it to simulate multiple simultaneous replacements will fail if the new values form something that looks like the old values.

+1 I have the need for one or two of these in every project (of a certain size) and have to come up with solutions each time with the re module or a loop. Not a fan of regex's for trivial tasks, or those that require a real parser. On 2018-02-03 14:04, Franklin? Lee wrote:

04.02.18 00:04, Franklin? Lee пише:
The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop

On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take
The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

Easily fixed by installing one of the alternate regex libraries. re performance and its edge cases have been discussed endlessly. Please look it up before restarting that discussion. Top-posted from my Windows phone From: Franklin? Lee Sent: Thursday, February 8, 2018 2:46 To: Serhiy Storchaka Cc: Python-Ideas Subject: Re: [Python-ideas] Complicate str methods On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take params in new forms. s.replace(old, new): Allow passing in a collection of olds. Allow passing in a single argument, a mapping of olds to news. Allow the olds in the mapping to be tuples of strings. s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators. s.startswith, s.endswith: Allow argument to be a collection of strings. s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence. s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters. The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

08.02.18 12:45, Franklin? Lee пише:
Yes, there is a special optimization in re here. It isn't free, you need to spend some time for preparing it. You need a special object that keeps an optimized representation for faster search. This makes it very unlikely be used in str, because you need either spend the time for compilation on every search, or use some kind of caching, which is not free too, adds complexity and increases memory consumption. Note also in case of re the compiler is implemented in Python. This reduces the complexity. Patches that add optimization for other common cases are welcomed.

On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
I ran Serhiy's tests (3.5.2) and got different results. # Setup: __builtins__.basestring = str #hack for re2 import in py3 import re, re2, regex n = 10000 s = "spa"*n+"m" p = re.compile("spam") pgex = regex.compile("spam") p2 = re2.compile("spam") # Tests: %timeit s.find("spam") %timeit p.search(s) %timeit pgex.search(s) %timeit p2.search(s) n = 100 350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) n = 1000 2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) n = 10000 17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Conclusions: - `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs. - re and regex have similar performance, probably due to a similar backend. - re2 is slowest. I suspect it's due to the wrapper. It may be copying the strings to a format suitable for the backend. P.S.: I also tested `"spam" in s`, which was linearly slower than `.find`. However, `in` is consistently faster than `.find` in my 3.6, so the discrepancy has likely been fixed. More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are twice as slow as my 3.5.2 for Ubuntu For Windows, but the re performance is similar. It's probably a compiler thing.

I think it shouldn't be str's method. They should be separate class to reuse internal tree. There are some Aho Corasick implementation on PyPI. As far as I know, AC is longest match. On the other hand, Go's replacer (it's trie based too) is:
Replacements are performed in order, without overlapping matches. https://golang.org/pkg/strings/#NewReplacer
On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

Le 03/02/2018 à 23:04, Franklin? Lee a écrit :
I second that proposal. I regularly need those and feel frustrated when it doesn't work. I even wrote a wrapper that does exactly this : https://github.com/Tygs/ww/blob/master/src/ww/wrappers/strings.py But because it's pure Python, it's guaranteed to be slow. Plus you need to install it every time you need it.

On 2/3/2018 5:04 PM, Franklin? Lee wrote:
Let s be a str. I propose to allow these existing str methods to take params in new forms.
Thanks for the honest title. As you sort of indicate, these can all be done with re module. However, you imply loops are needed besides, which is mostly not true. Your complications mostly translate to existing calls and hence are not needed. Perhaps 'Regular Expression HOWTO' could use more examples, or even a section on generalizing string methinds. Perhaps the string method doc needs suggestion to use re for multiple string args and references to the re howto. Please consider taking a look at both.
import re
s.replace(old, new): Allow passing in a collection of olds.
re.sub('Franklin|Lee', 'user', 'Franklin? Lee') 'user? user'
Remembering the name change is a nuisance
Allow passing in a single argument, a mapping of olds to news.
This needs to be a separate function, say 'dictsub', that joins the keys with '|' and calls re.sub with a function that does the lookup as the 2nd parameter. This would be a nice example for the howto. As you noted, this is generalization of str.translate, and might be proposed as a new re module function.
Allow the olds in the mapping to be tuples of strings.
A minor addition to dictsub.
s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators.
re.split is already more flexible than non-whitespace str.split and str.partition combined.
re.split, and hence str.rsplit(collection) are very sensible.
s.startswith, s.endswith: Allow argument to be a collection of strings.
bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, with the advantage that the actual match is available, and I think that one would nearly always want to know that match.
re.search with '^' at the beginning or '$' at the end covers both proposals, with the added flexibility of using MULTILINE mode to match at the beginning or end of lines within the string.
Comments above apply. re.search tells you which string matched as well as where. bool(re.search) is 'x in s'. re.findall and re.finditer give much more info than merely a count ('sum(bool(re.finditer))').
To avoid this, use re.sub with ^ or $ anchor and '' replacement.
re.sub('(Frank|Lee)$', '', 'Franklin? Lee') 'Franklin? '
This is what re does with 's1|s2|...|sn' patterns.
No loops needed.
This problem exists for single string replacement also. The standard solution is to not backtrack and not do overlapping replacements.
The easiest Python solution is to use regex
My claim above is that this is sufficient for all by one case, which should be a new function anyway.
-- Terry Jan Reedy

On Sun, Feb 4, 2018 at 10:43 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Picking up this one as an example, but this applies to all of them: the transformation you're giving here is dangerously flawed. If there are any regex special characters in the strings, this will either bomb with an exception, or silently do the wrong thing. The correct way to do it is (at least, I think it is): re.match("|".join(map(re.escape, strings)), testme) With that gotcha lurking in the wings, I think this should not be cavalierly dismissed with "just 'import re' and be done with it". Maybe put some recipes in the docs showing how to do this safely? ChrisA

On Sun, Feb 04, 2018 at 10:54:53AM +1100, Chris Angelico wrote:
Indeed. This is not Perl and "just use a regex" is not a close fit to the culture of Python. Regexes are a completely separate mini-language, and one which is the opposite of Pythonic. Instead of "executable pseudo-code", regexes are excessively terse and cryptic once you get past the simple examples. Doing anything complicated using regexes is painful. Even Larry Wall has criticised regex syntax for choosing poor defaults and information density. (Rarely used symbols get a single character, while frequently needed symbols are coded as multiple characters, so Perlish syntax has the worst of both worlds: too terse for casual users, too verbose for experts, hard to maintain for everyone.) Any serious programmer should have at least a passing familiarity with regexes. They are ubiquitous, and useful, especially as a common mini-language for user-specified searching. But I consider regexes to be the fall-back for when Python doesn't support the kind of string matching operation I need, not the primary solution. I would never write: re.match('start', text) re.search('spam', text) when text.startswith('start') text.find('spam') will do. I think this proposal to add more power to the string methods is worth some serious consideration. -- Steve

On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy <tjreedy@udel.edu> wrote:
My point wasn't that they _needed_ loops, but that they were simple enough to write (correctly) using loops, while simultaneous multiple .replace isn't. In any case, due to the backtracking algorithm used by re, a real specialized implementation can be much more efficient for the functions I listed, and will add value.
Nah, I'm fine. I can take the time to write inefficient simulations, or find them online. I proposed this not because I myself need it in the language, but because I think it's useful for many, yet easy to get wrong. It's the same with `sort`. In my design notes for my own toolkit, I also allow regular expressions as "olds. I didn't add that feature, since it'd cause a dependency.
Three issues: 1. As Chris Angelico pointed out, the input strings need to be escaped. 2. The input strings should also be reverse sorted by length. Longer strings should be higher priority than shorter strings, or else 'foobar' will never outbid 'foo'. 3. The substitution needs to be escaped (which is probably why it has a different name), using repl.replace('\\', r'\\'). As the docs note, using repl.escape here is wrong (at least before 3.7). Like I said, easy to get wrong.
If this function is added to re, it should also allow patterns as keys, and we may also want to add re.compile(collection_of_strings_and_patterns).
Really? re generally uses backtracking, not a DFA. Timing indicates that it is NOT using an efficient algorithm. import re def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__): return any(map(plusone, map(haystack.find, needles))) def findsub_re(haystack, needles, bool=bool, map=map, find=re.search, esc=re.escape): pattern = '|'.join(map(esc, needles)) return bool(find(pattern, haystack)) def findsub_re_cached(haystack, regex, bool=bool): return bool(regex.search(haystack)) n = 1000 haystack = 'a'*n + 'b' needles = ['a'*i + 'bb' for i in range(1, n)] needles = sorted(needles, key=len, reverse=True) regex = re.compile('|'.join(map(re.escape, needles))) %timeit findsub(haystack, needles) #=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit findsub_re(haystack, needles) #=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit findsub_re_cached(haystack, regex) #=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Here, the naive approach using basic string operations is over 300x faster than the re approach. Even a simple pure Python tree approach is faster. END = '' # Tree-based multiple string searches. def make_tree(needles): """Creates a tree such that each node maps characters to values. The tree is not optimized. """ root = {} for needle in needles: cur = root for c in needle: try: cur = cur[c] except KeyError: cur[c] = cur = {} #curious cur[END] = needle return root def findsub_tree(haystack, needles): tree = make_tree(needles) nodes = [tree] for c in haystack: nodes = [n[c] for n in nodes if c in n] #NOT OPTIMAL if any(END in n for n in nodes): return True nodes.append(tree) return False %timeit findsub_tree(haystack, needles) #=> 95.1 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Precomputing the tree saves about 80 ms of that time.
I don't know what you mean by the problem existing for single-string replacement, nor how your solution solves the problem. I'm talking about simulating multiple simultaneous replacements with the .replace method. .replace doesn't see previous replacements as candidates for replacement. However, using it to simulate multiple simultaneous replacements will fail if the new values form something that looks like the old values.

+1 I have the need for one or two of these in every project (of a certain size) and have to come up with solutions each time with the re module or a loop. Not a fan of regex's for trivial tasks, or those that require a real parser. On 2018-02-03 14:04, Franklin? Lee wrote:

04.02.18 00:04, Franklin? Lee пише:
The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop

On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take
The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

Easily fixed by installing one of the alternate regex libraries. re performance and its edge cases have been discussed endlessly. Please look it up before restarting that discussion. Top-posted from my Windows phone From: Franklin? Lee Sent: Thursday, February 8, 2018 2:46 To: Serhiy Storchaka Cc: Python-Ideas Subject: Re: [Python-ideas] Complicate str methods On Feb 7, 2018 17:28, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 04.02.18 00:04, Franklin? Lee пише: Let s be a str. I propose to allow these existing str methods to take params in new forms. s.replace(old, new): Allow passing in a collection of olds. Allow passing in a single argument, a mapping of olds to news. Allow the olds in the mapping to be tuples of strings. s.split(sep), s.rsplit, s.partition: Allow sep to be a collection of separators. s.startswith, s.endswith: Allow argument to be a collection of strings. s.find, s.index, s.count, x in s: Similar. These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence. s.[r|l]strip: Sadly, these functions already interpret their str arguments as collections of characters. The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance. Even for simple string search a regular expression can be more efficient than a str method. $ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)' 500000 loops, best of 5: 680 nsec per loop $ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")' 200000 loops, best of 5: 1.09 usec per loop That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread: https://mail.python.org/pipermail/python-ideas/2018-February/048879.html I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm. Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find: https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h http://effbot.org/zone/stringlib.htm Maybe there's a minimum length after which it's better to precompute a table. In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.

08.02.18 12:45, Franklin? Lee пише:
Yes, there is a special optimization in re here. It isn't free, you need to spend some time for preparing it. You need a special object that keeps an optimized representation for faster search. This makes it very unlikely be used in str, because you need either spend the time for compilation on every search, or use some kind of caching, which is not free too, adds complexity and increases memory consumption. Note also in case of re the compiler is implemented in Python. This reduces the complexity. Patches that add optimization for other common cases are welcomed.

On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
I ran Serhiy's tests (3.5.2) and got different results. # Setup: __builtins__.basestring = str #hack for re2 import in py3 import re, re2, regex n = 10000 s = "spa"*n+"m" p = re.compile("spam") pgex = regex.compile("spam") p2 = re2.compile("spam") # Tests: %timeit s.find("spam") %timeit p.search(s) %timeit pgex.search(s) %timeit p2.search(s) n = 100 350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) n = 1000 2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) n = 10000 17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Conclusions: - `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs. - re and regex have similar performance, probably due to a similar backend. - re2 is slowest. I suspect it's due to the wrapper. It may be copying the strings to a format suitable for the backend. P.S.: I also tested `"spam" in s`, which was linearly slower than `.find`. However, `in` is consistently faster than `.find` in my 3.6, so the discrepancy has likely been fixed. More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are twice as slow as my 3.5.2 for Ubuntu For Windows, but the re performance is similar. It's probably a compiler thing.

I think it shouldn't be str's method. They should be separate class to reuse internal tree. There are some Aho Corasick implementation on PyPI. As far as I know, AC is longest match. On the other hand, Go's replacer (it's trie based too) is:
Replacements are performed in order, without overlapping matches. https://golang.org/pkg/strings/#NewReplacer
On Sun, Feb 4, 2018 at 7:04 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

Le 03/02/2018 à 23:04, Franklin? Lee a écrit :
I second that proposal. I regularly need those and feel frustrated when it doesn't work. I even wrote a wrapper that does exactly this : https://github.com/Tygs/ww/blob/master/src/ww/wrappers/strings.py But because it's pure Python, it's guaranteed to be slow. Plus you need to install it every time you need it.
participants (11)
-
Chris Angelico
-
Franklin? Lee
-
INADA Naoki
-
Michel Desmoulin
-
Mike Miller
-
Neil Girdhar
-
Serhiy Storchaka
-
Steve Dower
-
Steven D'Aprano
-
Sven R. Kunze
-
Terry Reedy