[Python-ideas] hybrid regex engine: backtracking + Thompson NFA

Sun Nov 25 12:59:23 EST 2018

On 2018-11-25 16:52, Celelibi wrote:
> Hello,
> 
> I found this topic discussed on the python-dev ML back in 2010 [1].
> I'm bringing it up 8 years later with a variation.
> 
> In short: The article [2] highlight that backtracking-based regex
> engine (like SRE in python) have pathological cases that run in an
> exponential time of the input, while they could run in a linear time.
> Not mentioned by the article is that even quadratic time can be a
> problem with large inputs. Which happen pretty often when you're
> looking for delimited stuff like this :
> re.match(r'.*?,(.*),,', "a,"*10000)
> 
> Of course, there's a catch. Backreferences most-likely cannot be
> implemented in a guaranteed linear time, and some cases of look-behind
> assertions might prove difficult as well. But other features like
> alternatives (foo|bar) and repetitions (.*) are no problem.
> 
> The general idea of the Thompson's algorithm is that the simulation of
> the NFA is basically in all the reachable states at the same time
> while parsing the input string character by character. Of course, for
> regex engines that use a VM (like Python's SRE) you can also make the
> execution of the equivalent byte-code be in several states at once.
> 
> The 2010 discussion seems to be about having two separate engines and
> selecting the best one for a given regex. What I'm proposing here is
> to resort to backtracking only for the regex features that need it.
> Meaning that within a regex like r'(.*),.* .*,\1' the evaluation of
> the middle part would use the Thompson's algorithm, while the \1 could
> trigger the backtracking mechanism if the string doesn't match.
> 
> What do you think about it?
> Has this already been discussed and rejected?
> Is it just a matter of showing the code? (_sre.c seems... non-trivial)
> Has this already been juged not worth the maintainance effort?
> 
> AFAIK, there's not many hybrid regex engines out there. But one
> notable implementation is the third version of the Henry Spencer's
> regex engine [3]. Which he doesn't seem to have documented publicly,
> but postgres has done a pretty good job at reverse-engineering a high
> level view of it [4]. I'm still unsure how the backtracking of some
> parts of the regex interact with the multi-state evaluation of the
> other parts. But at least it exists and works, so it is feasible.
> 
> 
> Best regards,
> Celelibi
> 
> [1] https://mail.python.org/pipermail/python-dev/2010-March/098354.html
> [2] https://swtch.com/~rsc/regexp/regexp1.html
> [3] https://github.com/garyhouston/hsrex
> [4] https://github.com/postgres/postgres/tree/master/src/backend/regex
> 
This is open source. Nothing gets done unless someone decides to do it. 
So, yes, it's (just?) a matter of showing the code, and, if you want it 
in the re module, to persuade the core devs that it's worth doing and 
that you're willing to maintain it and fix the bugs.