[Python-Dev] (no subject)

Franklin? Lee leewangzhong+python at gmail.com
Wed Dec 27 01:55:58 EST 2017


On Tue, Dec 26, 2017 at 2:01 AM, Yogev Hendel <yogev at intsights.com> wrote:
>
> I don't know if this is the right place to put this,
> but I've found the following lines of code results in an incredibly long
> processing time.
> Perhaps it might be of use to someone.
>
> import re
> pat = re.compile('^/?(?:\\w+)/(?:[%\\w-]+/?)+/?$')
> pat.match('/t/a-aa-aa-aaaaa-aa-aa-aa-aa-aa-aa./')

(I think the correct place is python-list. python-dev is primarily for
the developers of Python itself. python-ideas is for proposing new
features and changes to the language. python-list is for general
discussion. Bug reports and feature requests belong in
https://bugs.python.org/ (where your post could also have gone).)

The textbook regular expression algorithm (which I believe grep uses)
runs in linear time with respect to the text length. The algorithm
used by Perl, Java, Python, JavaScript, Ruby, and many other languages
instead use a backtracking algorithm, which can run up to exponential
time with respect to text length. This worst-case is in fact necessary
(assuming P != NP): Perl allows (introduced?) backreferences, which
are NP-hard[1]. Perl also added some other features which complicate
things, but backreferences are enough.

The user-level solution is to understand how regexes are executed, and
to work around it.

Here are library-level solutions for your example:
1. Perl now has a regex optimizer, which will eliminate some
redundancies. Something similar can be added to Python, at first as a
third-party library.
2. In theory, we can use the textbook algorithm when possible, and the
backtracking algorithm when necessary. However, the textbook version
won't necessarily be faster, and may take more time to create, so
there's a tradeoff here.
3. To go even further, I believe it's possible to use the textbook
algorithm for subexpressions, while the overall expression uses
backtracking, internally iterating through the matches of the textbook
algorithm.

There's a series of articles by Russ Cox that try to get us back to
the textbook (see [2]). He and others implemented the ideas in the C++
library RE2[3], which has Python bindings[4]. RE2 was made for and
used on Google Code Search[5] (described in his articles), a (now
discontinued) search engine for open-source repos which allowed
regular expressions in the queries.

You can get a whiff of the limitations of the textbook algorithm by
checking out RE2's syntax[6] and seeing what features aren't
supported, though some features may be unsupported for different
reasons (such as being redundant syntax).
- Backreferences and lookaround assertions don't have a known solution.[7]
- Bounded repetition is only supported up to a limit (1000), because
each possible repetition needs its own set of states.
- Possessive quantifiers aren't supported. Greedy and reluctant quantifiers are.
- Groups and named groups _are_ supported. See the second and third
Russ Cox articles, with the term "submatch".[2]

(Apologies: I am making up reference syntax on-the-fly.)
[1] "Perl Regular Expression Matching is NP-Hard"
    https://perl.plover.com/NPC/
[2] "Regular Expression Matching Can Be Simple And Fast"
        https://swtch.com/~rsc/regexp/regexp1.html
    "Regular Expression Matching: the Virtual Machine Approach"
        https://swtch.com/~rsc/regexp/regexp2.html
    "Regular Expression Matching in the Wild"
        https://swtch.com/~rsc/regexp/regexp3.html
    "Regular Expression Matching with a Trigram Index"
        https://swtch.com/~rsc/regexp/regexp4.html
[3] RE2: https://github.com/google/re2
[4] pyre2: https://github.com/facebook/pyre2/
    Also see re2 and re3 on PyPI, which intend to be a drop-in
replacement. re3 is a Py3-compatible fork of re2, which last updated
in 2015.
[5] https://en.wikipedia.org/wiki/Google_Code_Search
[6] https://github.com/google/re2/wiki/Syntax
[7] Quote: "As a matter of principle, RE2 does not support constructs
for which only backtracking solutions are known to exist. Thus,
backreferences and look-around assertions are not supported."
    https://github.com/google/re2/wiki/WhyRE2


More information about the Python-Dev mailing list