[Python-ideas] regex module - proper implementation of alternation?
Eli Bendersky
eliben at gmail.com
Tue Jul 16 18:00:34 CEST 2013
On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python at mrabarnett.plus.com> wrote:
> On 16/07/2013 16:41, Eli Bendersky wrote:
>
>>
>>
>>
>> On Tue, Jul 16, 2013 at 8:00 AM, MRAB <python at mrabarnett.plus.com
>> <mailto:python at mrabarnett.**plus.com <python at mrabarnett.plus.com>>>
>> wrote:
>>
>> On 16/07/2013 04:10, Eli Bendersky wrote:
>>
>> Since the 'regex' module is a candidate for inclusion into the
>> stdlib, I
>> figured this would be a good place to ask.
>>
>> While discussing something related in pss
>> (https://github.com/eliben/**pss__<https://github.com/eliben/pss__>),
>> Ben Hoyt brought to my
>>
>> attention that
>> the implementation of alternation (foo|bar) in Python's default
>> regex
>> module (the SRE implementation) is very inefficient. And indeed,
>> looking
>> there it seems that | gets translated to an opcode that simply
>> means
>> going over all the alternatives in a loop trying to match each.
>> This is
>> not how a proper regex engine should implement alternation!
>>
>> A common advice given to Python programmers is to combine their
>> regexes
>> into a single one with |. This makes code faster, but it turns
>> out that
>> it's far from its full potential because the combination doesn't
>> go full
>> way to the DFA as it's supposed to.
>>
>> A question about 'regex' - is it implemented properly there?
>>
>> There are 2 ways of implementing regex: DFA and NFA.
>>
>> DFA is faster, but those using NFA do so because the implementation
>> offers additional features that make DFA tricky or impossible, such as
>> backreferences.
>>
>> Of course, you could use DFA when it's possible, NFA when it isn't, at
>> the cost of yet more code.
>>
>> The regex module uses NFA, just like re.
>>
>> If you want to improve regex, making it use DFA when possible, well,
>> the source code is open, and your contributions are welcome. Good
>> luck!
>> :-)
>>
>>
>> You can use NFA without backtracking, though, by keeping track of the
>> set of possible states. I believe (but am not 100% sure) this is the way
>> re2 works, for example.
>>
>> In the particular case of alternations, such approach is vastly superior
>> because the "possible states" set never grows large (assuming the
>> alternatives are not mostly the same). Whereas with backtracking you
>> always have to iterate over all of them.
>>
>> That said, I think you have answered my question - regex also uses a
>> backtracking implementation of NFA and iterates in case of alternations.
>> OK :-)
>>
>> Have you tried timing them (re, re2, regex, and possibly others) to see
> whether
> it's a problem in practice?
>
I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html -
explains the problems with backtracking, but focuses on a different aspect
(where backtracking leads to exponential behavior).
The problem did, however, come up in practice in pss (see
https://github.com/eliben/pss/issues/4); there was an attempt to make
heavier use of regex alternations and the result ended up being much slower
than one would expect. I had this experience in a different place as well
(writing regex-based lexers).
There is a performance improvement gained from moving from explicit
Python-level looping to alternations, but it's a small gain - something
that makes sense for just moving a loop from Python into C, but not for
doing something actually smart with the alternation (like looking at each
incoming character only once).
[P.S. please use reply-all in this thread]
Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130716/8900aacc/attachment-0001.html>
More information about the Python-ideas
mailing list