[New-bugs-announce] [issue29995] re.escape() escapes too much

Serhiy Storchaka report at bugs.python.org
Wed Apr 5 10:17:51 EDT 2017


New submission from Serhiy Storchaka:

re.escape() escapes all the characters except ASCII letters, numbers and '_'. This is too excessive, makes escaping and compiling slower and makes the pattern less human-readable. Characters "!\"%&\',/:;<=>@_`~" as well as non-ASCII characters are always literal in a regular expression and don't need escaping.

Proposed patch makes re.escape() escaping only minimal set of characters that can have special meaning in regular expressions. This includes special characters ".\\[]{}()*+?^$|", "-" (a range in a character set), "#" (starts a comment in verbose mode) and ASCII whitespaces (ignored in verbose mode).

The null character no longer need a special escaping.

The patch also increases the speed of re.escape() (even if it produces the same result).

$ ./python -m perf timeit -s 'from re import escape; s = "()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 42.2 us +- 0.8 us
Patched:    Median +- std dev: 11.4 us +- 0.1 us

$ ./python -m perf timeit -s 'from re import escape; s = b"()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 38.7 us +- 0.7 us
Patched:    Median +- std dev: 18.4 us +- 0.2 us

$ ./python -m perf timeit -s 'from re import escape; s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 40.3 us +- 0.5 us
Patched:    Median +- std dev: 33.1 us +- 0.6 us

$ ./python -m perf timeit -s 'from re import escape; s = b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 54.4 us +- 0.7 us
Patched:    Median +- std dev: 40.6 us +- 0.5 us

$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 156 us +- 3 us
Patched:    Median +- std dev: 43.5 us +- 0.5 us

$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode()' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 200 us +- 4 us
Patched:    Median +- std dev: 77.0 us +- 0.6 us

And the speed of compilation of escaped string.

$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"; p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched:  Median +- std dev: 1.96 ms +- 0.02 ms
Patched:    Median +- std dev: 1.16 ms +- 0.02 ms

$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode(); p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched:  Median +- std dev: 3.69 ms +- 0.04 ms
Patched:    Median +- std dev: 2.13 ms +- 0.03 ms

----------
components: Library (Lib), Regular Expressions
messages: 291177
nosy: ezio.melotti, mrabarnett, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: re.escape() escapes too much
type: enhancement
versions: Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue29995>
_______________________________________


More information about the New-bugs-announce mailing list