Re: [Python-ideas] Complicate str methods

8 Feb 2018

      On Thu, Feb 8, 2018 at 5:45 AM, Franklin? Lee
 wrote:
...
On Feb 7, 2018 17:28, "Serhiy Storchaka"  wrote:
...
The name of complicated str methods is regular expressions. For doing these
operations efficiently you need to convert arguments in special optimized
form. This is what re.compile() does. If make a compilation on every
invocation of a str method, this will add too large overhead and kill
performance.
Even for simple string search a regular expression can be more efficient
than a str method.
$ ./python -m timeit -s 'import re; p = re.compile("spam"); s =
"spa"*100+"m"' -- 'p.search(s)'
500000 loops, best of 5: 680 nsec per loop
$ ./python -m timeit -s 's = "spa"*100+"m"' -- 's.find("spam")'
200000 loops, best of 5: 1.09 usec per loop
I ran Serhiy's tests (3.5.2) and got different results.

    # Setup:
   __builtins__.basestring = str  #hack for re2 import in py3
    import re, re2, regex

    n = 10000
    s = "spa"*n+"m"
    p = re.compile("spam")
    pgex = regex.compile("spam")
    p2 = re2.compile("spam")

    # Tests:
    %timeit s.find("spam")
    %timeit p.search(s)
    %timeit pgex.search(s)
    %timeit p2.search(s)

n = 100
350 ns ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
554 ns ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
633 ns ± 8.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.62 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

n = 1000
2.17 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.57 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.46 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7.8 µs ± 72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

n = 10000
17.3 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
33.5 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
31.7 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
67.5 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Conclusions:
- `.find` is fastest. On 3.6.1 (Windows), it's about the same speed as
re: 638 ns vs 662 ns; 41.3 µs vs 43.8 µs.
- re and regex have similar performance, probably due to a similar backend.
- re2 is slowest. I suspect it's due to the wrapper. It may be copying
the strings to a format suitable for the backend.

P.S.: I also tested `"spam" in s`, which was linearly slower than
`.find`. However, `in` is consistently faster than `.find` in my 3.6,
so the discrepancy has likely been fixed.

More curious is that, on `.find`, my MSVC-compiled 3.6.1 and 3.5.2 are
twice as slow as my 3.5.2 for Ubuntu For Windows, but the re
performance is similar. It's probably a compiler thing.