[New-bugs-announce] [issue30285] Optimize case-insensitive regular expressions

Serhiy Storchaka report at bugs.python.org
Fri May 5 13:02:59 EDT 2017


New submission from Serhiy Storchaka:

Matching and searching case-insensitive regular expressions is much slower than matching and searching case-sensitive regular expressions. Case-insensitivity requires converting every character in input string to lower case and disables some optimizations. But there are only 2669 cased characters (52 in ASCII mode). For all other characters in the pattern we can use case-sensitive matching.

Results of microbenchmarks:

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  5000 loops, best of 5: 47.1 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i) +x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  2000 loops, best of 5: 109 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 537 usec per loop
Patched:    5000 loops, best of 5: 84.2 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[ \t]+x'); s = ' '*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 608 usec per loop
Patched:    5000 loops, best of 5: 84.1 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 226 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)/x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 284 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 483 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[/%]x'); s = ' '*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 549 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

The patch also optimizes searching some complex case-sensitive patterns. This is a side effect, I'll provide more comprehensive optimization in other issue.

$ ./python -m timeit -s "import re; p = re.compile(r'([xy])'); s = ' '*10000+'x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 633 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'((x|y)z)'); s = ' '*10000+'xz'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 716 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

----------
assignee: serhiy.storchaka
components: Library (Lib), Regular Expressions
messages: 293127
nosy: ezio.melotti, mrabarnett, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Optimize case-insensitive regular expressions
type: performance
versions: Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue30285>
_______________________________________


More information about the New-bugs-announce mailing list