[issue35932] Interpreter gets stuck while applying a regex pattern

Tim Peters report at bugs.python.org
Thu Feb 7 21:17:30 EST 2019


Tim Peters <tim at python.org> added the comment:

Without re.IGNORECASE, the leading

    ^[_a-z0-9-]+

can't even match the first character (`val` starts with uppercase Z), so it fails instantly.

With re.IGNORECASE, it's not "stuck", but is taking a verrrrrry long time to try an enormous number of (ultimately doomed) possibilities due to the way the regexp is written.

This is due to using nested quantifiers for no apparent reason.  For any character class C,

    (C+)*

matches the same set of strings as

    C*

but the former way can _try_ to match in an exponential (in the length of the string) number of ways.  So replace

    ([\.'_a-z0-9-]+)*
and
    ([\.a-z0-9-]+)*

with

    [\.'_a-z0-9-]*
and
    [\.a-z0-9-]*

and it fails to match `val` quickly (even with re.IGNORECASE).

For more on this (which applies to many regexp implementations, not just Python's), here's a start:

https://www.mathworks.com/matlabcentral/answers/95953-why-can-nested-quantifiers-in-regexp-can-cause-inefficient-failures-in-matlab-6-5-r13

The "Mastering Regular Expressions" book referenced in that answer is an excellent book-length treatment of this (and related) topic(s).

----------
nosy: +tim.peters
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35932>
_______________________________________


More information about the Python-bugs-list mailing list