Re: [Python-ideas] regex module - proper implementation of alternation?
On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python@mrabarnett.plus.com> wrote:
On 16/07/2013 16:41, Eli Bendersky wrote:
On Tue, Jul 16, 2013 at 8:00 AM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.**plus.com <python@mrabarnett.plus.com>>> wrote:
On 16/07/2013 04:10, Eli Bendersky wrote:
Since the 'regex' module is a candidate for inclusion into the stdlib, I figured this would be a good place to ask.
While discussing something related in pss (https://github.com/eliben/**pss__<https://github.com/eliben/pss__>), Ben Hoyt brought to my
attention that the implementation of alternation (foo|bar) in Python's default regex module (the SRE implementation) is very inefficient. And indeed, looking there it seems that | gets translated to an opcode that simply means going over all the alternatives in a loop trying to match each. This is not how a proper regex engine should implement alternation!
A common advice given to Python programmers is to combine their regexes into a single one with |. This makes code faster, but it turns out that it's far from its full potential because the combination doesn't go full way to the DFA as it's supposed to.
A question about 'regex' - is it implemented properly there?
There are 2 ways of implementing regex: DFA and NFA.
DFA is faster, but those using NFA do so because the implementation offers additional features that make DFA tricky or impossible, such as backreferences.
Of course, you could use DFA when it's possible, NFA when it isn't, at the cost of yet more code.
The regex module uses NFA, just like re.
If you want to improve regex, making it use DFA when possible, well, the source code is open, and your contributions are welcome. Good luck! :-)
You can use NFA without backtracking, though, by keeping track of the set of possible states. I believe (but am not 100% sure) this is the way re2 works, for example.
In the particular case of alternations, such approach is vastly superior because the "possible states" set never grows large (assuming the alternatives are not mostly the same). Whereas with backtracking you always have to iterate over all of them.
That said, I think you have answered my question - regex also uses a backtracking implementation of NFA and iterates in case of alternations. OK :-)
Have you tried timing them (re, re2, regex, and possibly others) to see
whether it's a problem in practice?
I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html - explains the problems with backtracking, but focuses on a different aspect (where backtracking leads to exponential behavior). The problem did, however, come up in practice in pss (see https://github.com/eliben/pss/issues/4); there was an attempt to make heavier use of regex alternations and the result ended up being much slower than one would expect. I had this experience in a different place as well (writing regex-based lexers). There is a performance improvement gained from moving from explicit Python-level looping to alternations, but it's a small gain - something that makes sense for just moving a loop from Python into C, but not for doing something actually smart with the alternation (like looking at each incoming character only once). [P.S. please use reply-all in this thread] Eli
On 16/07/2013 17:00, Eli Bendersky wrote:
On Tue, Jul 16, 2013 at 8:53 AM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
On 16/07/2013 16:41, Eli Bendersky wrote:
On Tue, Jul 16, 2013 at 8:00 AM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com> <mailto:python@mrabarnett.__plus.com <mailto:python@mrabarnett.plus.com>>> wrote:
On 16/07/2013 04:10, Eli Bendersky wrote:
Since the 'regex' module is a candidate for inclusion into the stdlib, I figured this would be a good place to ask.
While discussing something related in pss (https://github.com/eliben/__pss__ <https://github.com/eliben/pss__>), Ben Hoyt brought to my
attention that the implementation of alternation (foo|bar) in Python's default regex module (the SRE implementation) is very inefficient. And indeed, looking there it seems that | gets translated to an opcode that simply means going over all the alternatives in a loop trying to match each. This is not how a proper regex engine should implement alternation!
A common advice given to Python programmers is to combine their regexes into a single one with |. This makes code faster, but it turns out that it's far from its full potential because the combination doesn't go full way to the DFA as it's supposed to.
A question about 'regex' - is it implemented properly there?
There are 2 ways of implementing regex: DFA and NFA.
DFA is faster, but those using NFA do so because the implementation offers additional features that make DFA tricky or impossible, such as backreferences.
Of course, you could use DFA when it's possible, NFA when it isn't, at the cost of yet more code.
The regex module uses NFA, just like re.
If you want to improve regex, making it use DFA when possible, well, the source code is open, and your contributions are welcome. Good luck! :-)
You can use NFA without backtracking, though, by keeping track of the set of possible states. I believe (but am not 100% sure) this is the way re2 works, for example.
In the particular case of alternations, such approach is vastly superior because the "possible states" set never grows large (assuming the alternatives are not mostly the same). Whereas with backtracking you always have to iterate over all of them.
That said, I think you have answered my question - regex also uses a backtracking implementation of NFA and iterates in case of alternations. OK :-)
Have you tried timing them (re, re2, regex, and possibly others) to see whether it's a problem in practice?
I have not; this page - http://swtch.com/~rsc/regexp/regexp1.html - explains the problems with backtracking, but focuses on a different aspect (where backtracking leads to exponential behavior).
The problem did, however, come up in practice in pss (see https://github.com/eliben/pss/issues/4); there was an attempt to make heavier use of regex alternations and the result ended up being much slower than one would expect. I had this experience in a different place as well (writing regex-based lexers).
There is a performance improvement gained from moving from explicit Python-level looping to alternations, but it's a small gain - something that makes sense for just moving a loop from Python into C, but not for doing something actually smart with the alternation (like looking at each incoming character only once).
I'd welcome any realistic speed tests that would help me improve the performance of regex.
participants (2)
-
Eli Bendersky -
MRAB