[Python-ideas] regex module - proper implementation of alternation?

Tue Jul 16 18:00:34 CEST 2013

On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python at mrabarnett.plus.com> wrote:

> On 16/07/2013 16:41, Eli Bendersky wrote:
>
>>
>>
>>
>> On Tue, Jul 16, 2013 at 8:00 AM, MRAB <python at mrabarnett.plus.com
>> <mailto:python at mrabarnett.**plus.com <python at mrabarnett.plus.com>>>
>> wrote:
>>
>>     On 16/07/2013 04:10, Eli Bendersky wrote:
>>
>>         Since the 'regex' module is a candidate for inclusion into the
>>         stdlib, I
>>         figured this would be a good place to ask.
>>
>>         While discussing something related in pss
>>         (https://github.com/eliben/**pss__<https://github.com/eliben/pss__>),
>> Ben Hoyt brought to my
>>
>>         attention that
>>         the implementation of alternation (foo|bar) in Python's default
>>         regex
>>         module (the SRE implementation) is very inefficient. And indeed,
>>         looking
>>         there it seems that | gets translated to an opcode that simply
>> means
>>         going over all the alternatives in a loop trying to match each.
>>         This is
>>         not how a proper regex engine should implement alternation!
>>
>>         A common advice given to Python programmers is to combine their
>>         regexes
>>         into a single one with |. This makes code faster, but it turns
>>         out that
>>         it's far from its full potential because the combination doesn't
>>         go full
>>         way to the DFA as it's supposed to.
>>
>>         A question about 'regex' - is it implemented properly there?
>>
>>     There are 2 ways of implementing regex: DFA and NFA.
>>
>>     DFA is faster, but those using NFA do so because the implementation
>>     offers additional features that make DFA tricky or impossible, such as
>>     backreferences.
>>
>>     Of course, you could use DFA when it's possible, NFA when it isn't, at
>>     the cost of yet more code.
>>
>>     The regex module uses NFA, just like re.
>>
>>     If you want to improve regex, making it use DFA when possible, well,
>>     the source code is open, and your contributions are welcome. Good
>> luck!
>>     :-)
>>
>>
>> You can use NFA without backtracking, though, by keeping track of the
>> set of possible states. I believe (but am not 100% sure) this is the way
>> re2 works, for example.
>>
>> In the particular case of alternations, such approach is vastly superior
>> because the "possible states" set never grows large (assuming the
>> alternatives are not mostly the same). Whereas with backtracking you
>> always have to iterate over all of them.
>>
>> That said, I think you have answered my question - regex also uses a
>> backtracking implementation of NFA and iterates in case of alternations.
>> OK :-)
>>
>>  Have you tried timing them (re, re2, regex, and possibly others) to see
> whether
> it's a problem in practice?
>

I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html -
explains the problems with backtracking, but focuses on a different aspect
(where backtracking leads to exponential behavior).

The problem did, however, come up in practice in pss (see
https://github.com/eliben/pss/issues/4); there was an attempt to make
heavier use of regex alternations and the result ended up being much slower
than one would expect. I had this experience in a different place as well
(writing regex-based lexers).

There is a performance improvement gained from moving from explicit
Python-level looping to alternations, but it's a small gain - something
that makes sense for just moving a loop from Python into C, but not for
doing something actually smart with the alternation (like looking at each
incoming character only once).

[P.S. please use reply-all in this thread]

Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130716/8900aacc/attachment-0001.html>