[Python-ideas] Complicate str methods

Sat Feb 3 22:36:44 EST 2018

On Sat, Feb 3, 2018 at 6:43 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 2/3/2018 5:04 PM, Franklin? Lee wrote:
>>
>> Let s be a str. I propose to allow these existing str methods to take
>> params in new forms.
>
>
> Thanks for the honest title.  As you sort of indicate, these can all be done
> with re module.  However, you imply loops are needed besides, which is
> mostly not true.  Your complications mostly translate to existing calls and
> hence are not needed.

My point wasn't that they _needed_ loops, but that they were simple
enough to write (correctly) using loops, while simultaneous multiple
.replace isn't.

In any case, due to the backtracking algorithm used by re, a real
specialized implementation can be much more efficient for the
functions I listed, and will add value.

> Perhaps 'Regular Expression HOWTO' could use more examples, or even a
> section on generalizing string methinds.  Perhaps the string method doc
> needs suggestion to use re for multiple string args and references to the re
> howto.  Please consider taking a look at both.

Nah, I'm fine. I can take the time to write inefficient simulations,
or find them online. I proposed this not because I myself need it in
the language, but because I think it's useful for many, yet easy to
get wrong. It's the same with `sort`.

In my design notes for my own toolkit, I also allow regular
expressions as "olds. I didn't add that feature, since it'd cause a
dependency.

>>>> import re
>
>> s.replace(old, new):
>>      Allow passing in a collection of olds.
>
>
>>>> re.sub('Franklin|Lee', 'user', 'Franklin? Lee')
> 'user? user'
>
> Remembering the name change is a nuisance

Three issues:
1. As Chris Angelico pointed out, the input strings need to be escaped.
2. The input strings should also be reverse sorted by length. Longer
strings should be higher priority than shorter strings, or else
'foobar' will never outbid 'foo'.
3. The substitution needs to be escaped (which is probably why it has
a different name), using repl.replace('\\', r'\\'). As the docs note,
using repl.escape here is wrong (at least before 3.7).

Like I said, easy to get wrong.

>>      Allow passing in a single argument, a mapping of olds to news.
>
>
> This needs to be a separate function, say 'dictsub', that joins the keys
> with '|' and calls re.sub with a function that does the lookup as the 2nd
> parameter.  This would be a nice example for the howto.
>
> As you noted, this is generalization of str.translate, and might be proposed
> as a new re module function.

If this function is added to re, it should also allow patterns as
keys, and we may also want to add
re.compile(collection_of_strings_and_patterns).

>> These new forms can be optimized internally, as a search for multiple
>> candidate substrings can be more efficient than searching for one at a time.
>
>
> This is what re does with 's1|s2|...|sn' patterns.

Really? re generally uses backtracking, not a DFA.

Timing indicates that it is NOT using an efficient algorithm.

    import re

    def findsub(haystack, needles, map=map, any=any, plusone=1 .__add__):
        return any(map(plusone, map(haystack.find, needles)))

    def findsub_re(haystack, needles, bool=bool, map=map,
find=re.search, esc=re.escape):
        pattern = '|'.join(map(esc, needles))
        return bool(find(pattern, haystack))

    def findsub_re_cached(haystack, regex, bool=bool):
        return bool(regex.search(haystack))

    n = 1000
    haystack = 'a'*n + 'b'
    needles = ['a'*i + 'bb' for i in range(1, n)]
    needles = sorted(needles, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, needles)))

    %timeit findsub(haystack, needles)
    #=> 1.81 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    %timeit findsub_re(haystack, needles)
    #=> 738 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit findsub_re_cached(haystack, regex)
    #=> 680 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here, the naive approach using basic string operations is over 300x
faster than the re approach.

Even a simple pure Python tree approach is faster.

    END = ''
    # Tree-based multiple string searches.
    def make_tree(needles):
        """Creates a tree such that each node maps characters to
values. The tree is not optimized.
        """
        root = {}
        for needle in needles:
            cur = root
            for c in needle:
                try:
                    cur = cur[c]
                except KeyError:
                    cur[c] = cur = {}  #curious
            cur[END] = needle
        return root

    def findsub_tree(haystack, needles):
        tree = make_tree(needles)
        nodes = [tree]
        for c in haystack:
            nodes = [n[c] for n in nodes if c in n]  #NOT OPTIMAL
            if any(END in n for n in nodes):
                return True
            nodes.append(tree)
        return False

    %timeit findsub_tree(haystack, needles)
    #=> 95.1 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Precomputing the tree saves about 80 ms of that time.

>> It is harder to make multiple simultaneous replacements using one .replace
>> at a time, because previous replacements can form new things that look like
>> replaceables.
>
>
> This problem exists for single string replacement also.  The standard
> solution is to not backtrack and not do overlapping replacements.

I don't know what you mean by the problem existing for single-string
replacement, nor how your solution solves the problem. I'm talking
about simulating multiple simultaneous replacements with the .replace
method.

.replace doesn't see previous replacements as candidates for
replacement. However, using it to simulate multiple simultaneous
replacements will fail if the new values form something that looks
like the old values.