[Python-Dev] More work on SRE

Sat, 21 Jun 2003 22:52:16 -0300

There's a new version of the patch

    http://www.python.org/sf/757624

I understood there was some problems with the first implementation,
and changed some aspects of how the new implementation works to fix them.

This new implementation is more complex than the other implementations
(mine, and the original), given the many details it covers, but is
probably faster than these implementations in all situations (tests
very welcome). Additionally, it introduces new features like infinite
loop protection (*matching* the REs, not preventing their use).

Here is a list of valid REs not accepted by the original implementation,
and gracefully working on the new implementation (these used to blow up
SRE with an infinite loop, not because the implementation was recursive).

    re.match("(a?)+ab", "ab")
    re.match("^(a|b?)+a$", "a")
    re.match("(a*)+?ac", "ab")
    re.match("((a)*)+?ac", "ab")
    re.match("(?=a)*", "a")

Here is a list of unusual REs I've built while I was understanding how
complex the operations could be. All of them work.

    re.search('(.*?)(?<!-):', 'bc-:')
    re.match("(.+)+B", "AB")
    re.match("^(a+){2}b$", "aaaaab").groups()
    re.match("^(a+)+a$", "aaaaaa").groups()
    re.match("^((a|b?)a)+a$", "aa").groups()

The following RE is a good example of how *not to* build a regular
expression. The current implementation hangs for a *long* time
while executing it (more than many minutes.. I wasn't able to wait).
The new implementation takes as long as 2 seconds. Notice that
even the new implementation hangs (exponentially) with larger
strings.

    re.match("(a+?)+?ac", "a"*1000+"b")

The following SF issues were fixed, without being specifically
addressed (IOW, they just work :-):

    * 695688
    * 753711
    * 725149

As a matter of giving credits, I've included part of the lastmark
patch by Greg Chapman (712900).

Now, I understand that this is complex stuff, thus any kind of
help is very welcome. Here is how you can help:

    * provide benchmarks, good or bad, or provide
      "ready-to-benchmark" code;
    * provide information about tests you've done, good or bad,
      or provide "ready-to-test" code;

Also, I don't know about any bugs in the SRE engine which are not
fixed in the new implementation. If you have any non-working code
or some SF bug I'm not aware of, please let me know.

Ahhh.. now I'll relax a little bit. :-)

-- 
Gustavo Niemeyer
http://niemeyer.net