There are 2 ways of implementing regex: DFA and NFA.On 16/07/2013 04:10, Eli Bendersky wrote:
Since the 'regex' module is a candidate for inclusion into the stdlib, I
figured this would be a good place to ask.
While discussing something related in pss
(https://github.com/eliben/pss), Ben Hoyt brought to my attention that
the implementation of alternation (foo|bar) in Python's default regex
module (the SRE implementation) is very inefficient. And indeed, looking
there it seems that | gets translated to an opcode that simply means
going over all the alternatives in a loop trying to match each. This is
not how a proper regex engine should implement alternation!
A common advice given to Python programmers is to combine their regexes
into a single one with |. This makes code faster, but it turns out that
it's far from its full potential because the combination doesn't go full
way to the DFA as it's supposed to.
A question about 'regex' - is it implemented properly there?
DFA is faster, but those using NFA do so because the implementation
offers additional features that make DFA tricky or impossible, such as
backreferences.
Of course, you could use DFA when it's possible, NFA when it isn't, at
the cost of yet more code.
The regex module uses NFA, just like re.
If you want to improve regex, making it use DFA when possible, well,
the source code is open, and your contributions are welcome. Good luck!
:-)