[Python-ideas] regex module - proper implementation of alternation?

Tue Jul 16 19:39:54 CEST 2013

On 16/07/2013 17:00, Eli Bendersky wrote:
>
> On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python at mrabarnett.plus.com
> <mailto:python at mrabarnett.plus.com>> wrote:
>
>     On 16/07/2013 16:41, Eli Bendersky wrote:
>
>         On Tue, Jul 16, 2013 at 8:00 AM, MRAB
>         <python at mrabarnett.plus.com <mailto:python at mrabarnett.plus.com>
>         <mailto:python at mrabarnett.__plus.com
>         <mailto:python at mrabarnett.plus.com>>> wrote:
>
>              On 16/07/2013 04:10, Eli Bendersky wrote:
>
>                  Since the 'regex' module is a candidate for inclusion
>         into the
>                  stdlib, I
>                  figured this would be a good place to ask.
>
>                  While discussing something related in pss
>                  (https://github.com/eliben/__pss__
>         <https://github.com/eliben/pss__>), Ben Hoyt brought to my
>
>                  attention that
>                  the implementation of alternation (foo|bar) in Python's
>         default
>                  regex
>                  module (the SRE implementation) is very inefficient.
>         And indeed,
>                  looking
>                  there it seems that | gets translated to an opcode that
>         simply means
>                  going over all the alternatives in a loop trying to
>         match each.
>                  This is
>                  not how a proper regex engine should implement alternation!
>
>                  A common advice given to Python programmers is to
>         combine their
>                  regexes
>                  into a single one with |. This makes code faster, but
>         it turns
>                  out that
>                  it's far from its full potential because the
>         combination doesn't
>                  go full
>                  way to the DFA as it's supposed to.
>
>                  A question about 'regex' - is it implemented properly
>         there?
>
>              There are 2 ways of implementing regex: DFA and NFA.
>
>              DFA is faster, but those using NFA do so because the
>         implementation
>              offers additional features that make DFA tricky or
>         impossible, such as
>              backreferences.
>
>              Of course, you could use DFA when it's possible, NFA when
>         it isn't, at
>              the cost of yet more code.
>
>              The regex module uses NFA, just like re.
>
>              If you want to improve regex, making it use DFA when
>         possible, well,
>              the source code is open, and your contributions are
>         welcome. Good luck!
>              :-)
>
>
>         You can use NFA without backtracking, though, by keeping track
>         of the
>         set of possible states. I believe (but am not 100% sure) this is
>         the way
>         re2 works, for example.
>
>         In the particular case of alternations, such approach is vastly
>         superior
>         because the "possible states" set never grows large (assuming the
>         alternatives are not mostly the same). Whereas with backtracking you
>         always have to iterate over all of them.
>
>         That said, I think you have answered my question - regex also uses a
>         backtracking implementation of NFA and iterates in case of
>         alternations.
>         OK :-)
>
>     Have you tried timing them (re, re2, regex, and possibly others) to
>     see whether
>     it's a problem in practice?
>
>
> I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html -
> explains the problems with backtracking, but focuses on a different
> aspect (where backtracking leads to exponential behavior).
>
> The problem did, however, come up in practice in pss (see
> https://github.com/eliben/pss/issues/4); there was an attempt to make
> heavier use of regex alternations and the result ended up being much
> slower than one would expect. I had this experience in a different place
> as well (writing regex-based lexers).
>
> There is a performance improvement gained from moving from explicit
> Python-level looping to alternations, but it's a small gain - something
> that makes sense for just moving a loop from Python into C, but not for
> doing something actually smart with the alternation (like looking at
> each incoming character only once).
>
I'd welcome any realistic speed tests that would help me improve the
performance of regex.