interesting article on regex performance

http://code.google.com/p/re2/ On 3/11/10 8:52 PM, "Neal Becker" <ndbecker2@gmail.com> wrote:
http://swtch.com/~rsc/regexp/regexp1.html
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

Peter Portante wrote:
On 3/11/10 8:52 PM, "Neal Becker" <ndbecker2@gmail.com> wrote:
Both interesting links. I'll add another one to the list: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-reg... To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support. There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support). I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a "See Also" in the re module documentation to point people towards the more limited (but more robust) regex engine implementation. The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution) Only once such a hybrid library had been proven in practice could the default re module be considered for possible update. Basically, I see some interesting ideas there, but I don't see anything that will impact the Python core or standard library in the short term. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nick Coghlan wrote:
Peter Portante wrote:
On 3/11/10 8:52 PM, "Neal Becker" <ndbecker2@gmail.com> wrote:
Both interesting links. I'll add another one to the list: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-reg...
To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support.
There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support).
I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI. Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a "See Also" in the re module documentation to point people towards the more limited (but more robust) regex engine implementation.
The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution)
Only once such a hybrid library had been proven in practice could the default re module be considered for possible update.
Basically, I see some interesting ideas there, but I don't see anything that will impact the Python core or standard library in the short term.
Agreed on the whole. However, it might be interesting to explore the "hybrid mode" using some of the older, C-based libraries referenced by the original (Russ Cox) article, e.g. Rob Pike's Unicode-aware 1992 implmentation, as ported for Unix: http://swtch.com/plan9port/unix/ Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkuac7AACgkQ+gerLs4ltQ6t3wCgxQPsnpbyE1N9BEGg7nHqcBVP q04AnA6RMJw83+uc5u6Oyud89SXEoP5C =+0aL -----END PGP SIGNATURE-----

I have some regex-intensive Python that might benefit from this, but I don't think I have enough skill to do a set of Python bindings. An ideal first cut would be to enable this: import re2 as re I live in hope... Brent L

On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan <ncoghlan@gmail.com> wrote: [snip]
To bring this on-topic for python-dev by considering how it could apply to Python's default re engine, I think the key issue is that any updates to the default engine would need to remain backwards compatible with all of the tricks that re2 doesn't support.
There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support).
I don't see why C++ would be a deal-breaker in this case, since it would be restricted to an extension module.
I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI.
FWIW, re2 is heavily, heavily used in production at Google. Stabilizing any proposed Python bindings would be a good idea, but I'm not sure how much more "baking" re2's core functionality needs.
Once that happens (and assuming the bindings earn a good reputation), the first step towards integration would be to include a "See Also" in the re module documentation to point people towards the more limited (but more robust) regex engine implementation.
The next step would probably be a hybrid third party library that exploits the NFA approach when it can, but resorts to backtracking when it has to in order to handle full regex functionality. (Although developers would need to be able to disable the backtracking support in order to restore re2's guarantees of linear time execution)
We considered such a hybrid approach for Unladen Swallow, but rejected it on the advice of the V8 team [1]: you end up maintaining two totally separate, incompatible regex engines; the hybrid system comes with interesting, possibly unintuitive performance/correctness issues when bailing from one implementation to another; performance is unstable as small changes are made to the regexes; and it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.) re2 is very useful when you want to run user-provided regexes and want to protect your backends against pathological/malicious regex input, but I'm not sure how applicable it is to Python. I think there are more promising strategies to improve regex performance, such as reusing the new JIT infrastructure to JIT-compile regular expressions to machine code (along the lines of V8's irregexp). Some work has already been done in this direction, and I'd be thrilled to mentor any GSoC students interested in working on such a project this summer. Lastly, anyone interested in working on Python regex performance should take a look at the regex benchmarks in the standard benchmark suite [2]. Thanks, Collin Winter [1] - http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html#c48... [2] - http://hg.python.org/benchmarks/file/5b8fe389710b/performance

>> There are major practical problems associated with making such a leap >> directly (Google's re2 engine is in C++ rather than C and we'd have >> to keep the existing implementation around regardless to handle the >> features that re2 doesn't support). Collin> I don't see why C++ would be a deal-breaker in this case, since Collin> it would be restricted to an extension module. Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered "optional", I doubt anybody thinks of it as optional nowadays. It's used in the regression test suite anyway. It would be tough to run unit tests on such minority platforms without it. You'd have to maintain both the current sre implementation and the new re2 implementation for a long while into the future. As I was reading the code I thought, "Great! This stuff is so simple. It's even all written in C." Then I looked at the re2 page. :-( Skip

On Fri, Mar 12, 2010 at 11:29 AM, <skip@pobox.com> wrote:
>> There are major practical problems associated with making such a leap >> directly (Google's re2 engine is in C++ rather than C and we'd have >> to keep the existing implementation around regardless to handle the >> features that re2 doesn't support).
Collin> I don't see why C++ would be a deal-breaker in this case, since Collin> it would be restricted to an extension module.
Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered "optional", I doubt anybody thinks of it as optional nowadays. It's used in the regression test suite anyway. It would be tough to run unit tests on such minority platforms without it. You'd have to maintain both the current sre implementation and the new re2 implementation for a long while into the future.
re2 is not a full replacement for Python's current regex semantics: it would only serve as an accelerator for a subset of the current regex language. Given that, it makes perfect sense that it would be optional on such minority platforms (much like the incoming JIT). Collin

Collin> re2 is not a full replacement for Python's current regex Collin> semantics: it would only serve as an accelerator for a subset of Collin> the current regex language. Given that, it makes perfect sense Collin> that it would be optional on such minority platforms (much like Collin> the incoming JIT). Sure, but over the years Python has supported at least four different regular expression modules that I'm aware of (regex, regexp, and the current re module with different extension modules underneath it, perhaps there were others). During some of that time more than one module was distributed with Python proper. I think the desire today would be that only one regular expression module be distributed with Python (that would be my vote anyway). Getting people to move off the older libraries was difficult. If re2 can't replace sre under the covers than I think it belongs in PyPI, not the Python distribution. That said, that suggests to me that a different NFA or DFA implementation written in C would replace sre, one not written in C++. Hopefully that provides some context for my earlier response. Skip

On 12 Mar 2010, at 15:22, skip@pobox.com wrote:
Collin> re2 is not a full replacement for Python's current regex Collin> semantics: it would only serve as an accelerator for a subset of Collin> the current regex language. Given that, it makes perfect sense Collin> that it would be optional on such minority platforms (much like Collin> the incoming JIT).
Sure, but over the years Python has supported at least four different regular expression modules that I'm aware of (regex, regexp, and the current re module with different extension modules underneath it, perhaps there were others). During some of that time more than one module was distributed with Python proper. I think the desire today would be that only one regular expression module be distributed with Python (that would be my vote anyway). Getting people to move off the older libraries was difficult. If re2 can't replace sre under the covers than I think it belongs in PyPI, not the Python distribution. That said, that suggests to me that a different NFA or DFA implementation written in C would replace sre, one not written in C++.
re2 would be a supplement to re -- it is not a replacement, and Python would run fine if it's not present on some platforms. It's like a floating-point processor: you can do all math you need with just an integer processor. But if you have an FPU present, then it makes sense to use it for the FP operations. Jared

Le Fri, 12 Mar 2010 13:29:09 -0600, skip@pobox.com a écrit :
Traditionally Python has run on some (minority) platforms where C++ was unavailable.
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old... Regards Antoine.

On Mar 12, 2010, at 05:03 PM, Antoine Pitrou wrote:
Le Fri, 12 Mar 2010 13:29:09 -0600, skip@pobox.com a écrit :
Traditionally Python has run on some (minority) platforms where C++ was unavailable.
Is this concern still valid?
Certainly not if Unladen Swallow blazes the trail. -Barry

Antoine Pitrou:
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old...
There are still viable platforms which only support subsets of C++. IIRC, Android does not support exceptions in C++. Neil

On Fri, Mar 12, 2010 at 2:18 PM, Neil Hodgson <nyamatongwe@gmail.com> wrote:
Antoine Pitrou:
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old...
There are still viable platforms which only support subsets of C++. IIRC, Android does not support exceptions in C++.
Looks like they'll be getting exceptions "soon": http://groups.google.com/group/android-ndk/browse_thread/thread/89db67ed1fbf... But yeah, thanks for the concrete example, and I'd agree that Python should compile with -fno-exceptions, for a couple more years at least.

Neil Hodgson wrote:
Antoine Pitrou:
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old...
There are still viable platforms which only support subsets of C++. IIRC, Android does not support exceptions in C++.
Neil
Does re2 require exceptions?

On Sat, Mar 13, 2010 at 6:00 AM, Neal Becker <ndbecker2@gmail.com> wrote:
Neil Hodgson wrote:
Antoine Pitrou:
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old...
There are still viable platforms which only support subsets of C++. IIRC, Android does not support exceptions in C++.
Neil
Does re2 require exceptions?
Nope. No exceptions. It was written following this style guide: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Exceptions

On Fri, Mar 12, 2010 at 4:03 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Fri, 12 Mar 2010 13:29:09 -0600, skip@pobox.com a écrit :
Traditionally Python has run on some (minority) platforms where C++ was unavailable.
Is this concern still valid? We are in the 2010s now. I'm not saying I want us to put some C++ in the core interpreter, but the portability argument sounds a little old...
One area where this _may_ be a problem is with embedded systems. I believe there are some instances where folks have built Python into an embedded system (with an RTOS say VxWorks, Symbian, QNX Neutrino, Nucleus, etc...) where C++ is not always the easiest to develop in. Admittedly, though, these types of systems are by far a minority with respect to Python. Just thought I would mention it anyhow. -- Meador

Antoine> skip@pobox.com a écrit : >> >> Traditionally Python has run on some (minority) platforms where C++ >> was unavailable. Antoine> Is this concern still valid? We are in the 2010s now. Like I said, *minority* platforms. Here are some which come to mind as quite possibly not supporting C++ well, if at all, yet all have some dialect of Python available: Windows CE: http://sourceforge.net/projects/pythonce/ Palm: http://pippy.sourceforge.net/ iPod: http://ciberjacobo.com/python-ipod OS/2: http://www.andymac.org/ QNX: http://sourceforge.net/projects/pyqnx/ If all you care about are mainstream Windows, Unix/Linux and Mac OSX environments then C++ is no problem. Skip

On Fri, Mar 12, 2010 at 7:54 PM, <skip@pobox.com> wrote:
Antoine> skip@pobox.com a écrit : >> >> Traditionally Python has run on some (minority) platforms where C++ >> was unavailable.
Antoine> Is this concern still valid? We are in the 2010s now.
Like I said, *minority* platforms. Here are some which come to mind as quite possibly not supporting C++ well, if at all, yet all have some dialect of Python available:
Windows CE: http://sourceforge.net/projects/pythonce/
http://www.wirelessdevnet.com/channels/pda/training/vcce.html
http://prc-tools.sourceforge.net/
http://ipodlinux.org/wiki/Toolchain
OS/2: http://www.andymac.org/
I can't find a modern c++ compiler for OS/2. Then again, your link only provides python 2.4.
http://www.qnx.com/products/tools/
If all you care about are mainstream Windows, Unix/Linux and Mac OSX environments then C++ is no problem.
If you care about any of these minority platforms except OS/2, C++ is also available, as a google search for "<platform name> C++" quickly finds. OS/2 might be available too; I just didn't look for very long. If you know of platforms that don't support particular features of C++, please link to documentation of that like Neil Hodgson did. If not, please stop spreading FUD.

Jeffrey> If you know of platforms that don't support particular features Jeffrey> of C++, please link to documentation of that like Neil Hodgson Jeffrey> did. If not, please stop spreading FUD. No need to get your knickers in a twist. Every couple years someone asks, "why isn't Python written in C++?" The lack of good C++ support for many (minority) platforms is often the primary explanation. If that's changing, I have no problem with it. I spent a fair amount of time with Google trying to find representative threads here or in comp.lang.python but failed to find any needles in the haystack. Perhaps you will have better luck. Skip

Jeffrey Yasskin wrote:
On Fri, Mar 12, 2010 at 7:54 PM, <skip@pobox.com> wrote:
OS/2: http://www.andymac.org/
I can't find a modern c++ compiler for OS/2. Then again, your link only provides python 2.4.
While the gcc used as part of the EMX toolchain is 2.8.1, there are ports of gcc 3.3.5 and 4.4.x using a somewhat compatible runtime (known as klibc). I have an unreleased 2.6 binary package which I'm holding up pending 2.6.5. OpenWatcom is also supported on OS/2, though I don't know whether the C++ support is advanced enough... while I like this compiler, the runtime support (or lack thereof) for Posixish features makes platform support a lot of work. Andrew. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia

Am 12.03.2010 20:29, schrieb skip@pobox.com:
>> There are major practical problems associated with making such a leap >> directly (Google's re2 engine is in C++ rather than C and we'd have >> to keep the existing implementation around regardless to handle the >> features that re2 doesn't support).
Collin> I don't see why C++ would be a deal-breaker in this case, since Collin> it would be restricted to an extension module.
Traditionally Python has run on some (minority) platforms where C++ was unavailable. While the re module is a dynamically linked extension module and thus could be considered "optional", I doubt anybody thinks of it as optional nowadays.
Actually, _sre is a builtin module by default since at least Python 2.5. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

Collin Winter wrote:
On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
There are major practical problems associated with making such a leap directly (Google's re2 engine is in C++ rather than C and we'd have to keep the existing implementation around regardless to handle the features that re2 doesn't support).
I don't see why C++ would be a deal-breaker in this case, since it would be restricted to an extension module.
It wouldn't be a dealbreaker for an optional regex acceleration module, I agree. It's still a practical problem in the sense that C++ code is very much in the minority in the CPython code base.
I would say it is better to let re2 bake for a while and see if anyone is motivated to come up with Python bindings for it and release them on PyPI.
FWIW, re2 is heavily, heavily used in production at Google. Stabilizing any proposed Python bindings would be a good idea, but I'm not sure how much more "baking" re2's core functionality needs.
It was more the bindings side of things I was referring to - the algorithm's performance is no doubt well established, but how easy it is for non-Google users to pick up and integrate with other software is still to be determined.
We considered such a hybrid approach for Unladen Swallow, but rejected it on the advice of the V8 team [1]: you end up maintaining two totally separate, incompatible regex engines; the hybrid system comes with interesting, possibly unintuitive performance/correctness issues when bailing from one implementation to another; performance is unstable as small changes are made to the regexes; and it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.)
re2 is very useful when you want to run user-provided regexes and want to protect your backends against pathological/malicious regex input, but I'm not sure how applicable it is to Python. I think there are more promising strategies to improve regex performance, such as reusing the new JIT infrastructure to JIT-compile regular expressions to machine code (along the lines of V8's irregexp). Some work has already been done in this direction, and I'd be thrilled to mentor any GSoC students interested in working on such a project this summer.
Lastly, anyone interested in working on Python regex performance should take a look at the regex benchmarks in the standard benchmark suite [2].
This is the kind of baking that I was really talking about - there's an awful lot of investigation to be done before moving Python in the direction of an NFA based regex engine could be seriously considered. Good to hear the perspective of someone that has already looked into this in some detail though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
participants (16)
-
Andrew MacIntyre
-
Antoine Pitrou
-
Barry Warsaw
-
Brent Longborough
-
Collin Winter
-
Georg Brandl
-
Gregory P. Smith
-
Jared Grubb
-
Jeffrey Yasskin
-
Meador Inge
-
Neal Becker
-
Neil Hodgson
-
Nick Coghlan
-
Peter Portante
-
skip@pobox.com
-
Tres Seaver