<div dir="auto"><div><div class="gmail_extra"><div class="gmail_quote">On Feb 7, 2018 17:28, "Serhiy Storchaka" <<a href="mailto:storchaka@gmail.com">storchaka@gmail.com</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">04.02.18 00:04, Franklin? Lee пише:<div class="quoted-text"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Let s be a str. I propose to allow these existing str methods to take params in new forms.<br>

<br>

s.replace(old, new):<br>

     Allow passing in a collection of olds.<br>

     Allow passing in a single argument, a mapping of olds to news.<br>

     Allow the olds in the mapping to be tuples of strings.<br>

<br>

s.split(sep), s.rsplit, s.partition:<br>

     Allow sep to be a collection of separators.<br>

<br>

s.startswith, s.endswith:<br>

     Allow argument to be a collection of strings.<br>

<br>

s.find, s.index, s.count, x in s:<br>

     Similar.<br>

     These methods are also in `list`, which can't distinguish between items, subsequences, and subsets. However, `str` is already inconsistent with `list` here: list.M looks for an item, while str.M looks for a subsequence.<br>

<br>

s.[r|l]strip:<br>

     Sadly, these functions already interpret their str arguments as collections of characters.<br>

</blockquote>

<br></div>

The name of complicated str methods is regular expressions. For doing these operations efficiently you need to convert arguments in special optimized form. This is what re.compile() does. If make a compilation on every invocation of a str method, this will add too large overhead and kill performance.<br>

<br>

Even for simple string search a regular expression can be more efficient than a str method.<br>

<br>

$ ./python -m timeit -s 'import re; p = re.compile("spam"); s = "spa"*100+"m"' -- 'p.search(s)'<br>

500000 loops, best of 5: 680 nsec per loop<br>

<br>

$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'<br>

200000 loops, best of 5: 1.09 usec per loop</blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">That's an odd result. Python regexes use backtracking, not a DFA. I gave a timing test earlier in the thread:</div><div dir="auto"><a href="https://mail.python.org/pipermail/python-ideas/2018-February/048879.html">https://mail.python.org/pipermail/python-ideas/2018-February/048879.html</a><br></div><div dir="auto">I compared using repeated .find()s against a precompiled regex, then against a pure Python and unoptimized tree-based algorithm.</div><div dir="auto"><br></div><div dir="auto">Could it be that re uses an optimization that can also be used in str? CPython uses a modified Boyer-Moore for str.find:</div><div dir="auto"><a href="https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h">https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h</a></div><div dir="auto"><a href="http://effbot.org/zone/stringlib.htm">http://effbot.org/zone/stringlib.htm</a></div><div dir="auto">Maybe there's a minimum length after which it's better to precompute a table.</div><div dir="auto"><br></div><div dir="auto">In any case, once you have branches in the regex, which is necessary to emulate these features, it will start to slow down because it has to travel down both branches in the worst case.</div><div dir="auto"></div></div>