
Hi folks, I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html However, reaching the chart in the lower third of the article, I saw Python 2.4 measured against a naive Thompson matching implementation. And I was surprised about how bad it performed compared to an unoptimized version of an older than dirt algorithm. So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà, 100% cpu and no results so far. Nothing has changed at all since 2007.
import re re.match(r'a?'*30 + r'a'*30, 'a'*30) .... (still waiting)
Quoting from the article: " Some might argue that this test is unfair to the backtracking implementations, since it focuses on an uncommon corner case. This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy." Victor, as the head of Python performance department, and Matthew, as the maintainer of the new regex module, what is your stance on this particular issue? From my perspective, I can say, that regular expressions might worth optimizing especially for web applications (url matching usually uses regexes) but also for other applications where I've seen many tight loops using regexes as well. So, I am probing interest on this topic here. Best, Sven

2017-01-26 22:13 GMT+01:00 Sven R. Kunze <srkunze@mail.de>:
Hi folks,
I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
However, reaching the chart in the lower third of the article, I saw Python 2.4 measured against a naive Thompson matching implementation. And I was surprised about how bad it performed compared to an unoptimized version of an older than dirt algorithm.
So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà, 100% cpu and no results so far. Nothing has changed at all since 2007.
import re re.match(r'a?'*30 + r'a'*30, 'a'*30) .... (still waiting)
Quoting from the article: " Some might argue that this test is unfair to the backtracking implementations, since it focuses on an uncommon corner case. This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy."
Victor, as the head of Python performance department, and Matthew, as the maintainer of the new regex module, what is your stance on this particular issue?
From my perspective, I can say, that regular expressions might worth optimizing especially for web applications (url matching usually uses regexes) but also for other applications where I've seen many tight loops using regexes as well. So, I am probing interest on this topic here.
Best, Sven
Hi, I can't speak about the details of mrab's implementation, but using regex, I get the resulting match instantly: Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information.
import regex regex.match(r'a?'*30 + r'a'*30, 'a'*30) <regex.Match object; span=(0, 30), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'>
(I personally prefer to use regex for other advantages, than this artificial case, but it certainly doesn't hurt to have better performance here too:) regards, vbr

On 26.01.2017 22:33, Vlastimil Brom wrote:
Hi, I can't speak about the details of mrab's implementation, but using regex, I get the resulting match instantly: [...]
Nice! I focused on the stdlib re module as this is mainly used by other frameworks (like Django).
(I personally prefer to use regex for other advantages, than this artificial case, but it certainly doesn't hurt to have better performance here too:)
Me, too. So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something. Cheers, Sven

On 2017-01-26 21:46, Sven R. Kunze wrote:
On 26.01.2017 22:33, Vlastimil Brom wrote:
Hi, I can't speak about the details of mrab's implementation, but using regex, I get the resulting match instantly: [...]
Nice! I focused on the stdlib re module as this is mainly used by other frameworks (like Django).
(I personally prefer to use regex for other advantages, than this artificial case, but it certainly doesn't hurt to have better performance here too:)
Me, too.
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.

On Jan 26, 2017, at 5:16 PM, MRAB <python@mrabarnett.plus.com> wrote:
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.
What’s the status of regex inclusion in the stdlib? - Ł

On 27/01/2017 17:03, Łukasz Langa wrote:
On Jan 26, 2017, at 5:16 PM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.
What’s the status of regex inclusion in the stdlib?
- Ł
I've asked about this in the past, but have now come to the conclusion that it is way better to leave regex, and many other third party modules, on pypi rather than have them tied into the Python release cycle. If YMMV so be it. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On 2017-01-27 17:03, Łukasz Langa wrote:
On Jan 26, 2017, at 5:16 PM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.
What’s the status of regex inclusion in the stdlib?
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.

On 27/01/2017 22:24, MRAB wrote:
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.
I tried packaging it (pip build) and ran into a minor syntax error (on AIX using xlc). How do I go about reporting what I found?

On 2017-01-27 22:15, Michael Felt wrote:
On 27/01/2017 22:24, MRAB wrote:
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.
I tried packaging it (pip build) and ran into a minor syntax error (on AIX using xlc). How do I go about reporting what I found?
Report it on the project's issue tracker.

On 27 January 2017 at 22:24, MRAB <python@mrabarnett.plus.com> wrote:
On 2017-01-27 17:03, Łukasz Langa wrote:
On Jan 26, 2017, at 5:16 PM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like dfa, nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.
What’s the status of regex inclusion in the stdlib?
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/28/2017 9:43 AM, Nick Coghlan wrote:
On 27 January 2017 at 22:24, MRAB <python@mrabarnett.plus.com> wrote:
On 2017-01-27 17:03, Łukasz Langa wrote:
What’s the status of regex inclusion in the stdlib?
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip).
So like pip, we would bundle the current version, install it in site-packages, and users who use it could upgrade as desired, even after x.y bugfix ends. What would be nice would be to have a short entry in the docs for bundled modules that links to the real doc, wherever it is. -- Terry Jan Reedy

On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip).
How would that work exactly. I.e. is there a PEP? While I think it could be a good idea to bundle (more?) third party packages for a variety of reasons, I want to make sure it's done in a way that's still friendly to downstream repackagers, as I'm sure you do to. :) For Debian/Ubuntu, we already have some additional machinations to make ensurepip and such work properly in a distro packaging environment. We'd want to be able to do the same for any bundled packages as well. Cheers, -Barry

On Sat, 28 Jan 2017 12:07:05 -0500 Barry Warsaw <barry@python.org> wrote:
On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip).
How would that work exactly. I.e. is there a PEP?
While I think it could be a good idea to bundle (more?) third party packages for a variety of reasons, I want to make sure it's done in a way that's still friendly to downstream repackagers, as I'm sure you do to. :)
That sounds like a lot of effort and maintenance... Don't we bundle pip *exactly* so that we don't have to bundle other third-party packages and instead tell users to "just use `pip install <some package>`"? To sum it up, how about we simply add an official suggestion to use regex in the docs? Regards Antoine.

Why not declare re deprecated and remove it in Python 4? I am pretty sure everyone wants to keep re in all 3.x releases, but that support need not extend beyond. So Py4 would have no battery for re, but it would (should!) be common knowledge that regex was the go-to module for general-purpose pattern matching. If re has advantages in certain situations someone might upgrade the 3.x implementation and provide it as a 3rd-party module, though the effort involved would be significant, so someone would have to be motivated to keep it. regards Steve Steve Holden On Sun, Jan 29, 2017 at 4:13 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip).
How would that work exactly. I.e. is there a PEP?
While I think it could be a good idea to bundle (more?) third party
On Sat, 28 Jan 2017 12:07:05 -0500 Barry Warsaw <barry@python.org> wrote: packages
for a variety of reasons, I want to make sure it's done in a way that's still friendly to downstream repackagers, as I'm sure you do to. :)
That sounds like a lot of effort and maintenance... Don't we bundle pip *exactly* so that we don't have to bundle other third-party packages and instead tell users to "just use `pip install <some package>`"?
To sum it up, how about we simply add an official suggestion to use regex in the docs?
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ steve%40holdenweb.com

On 29 January 2017 at 20:30, Steve Holden <steve@holdenweb.com> wrote:
Why not declare re deprecated and remove it in Python 4? I am pretty sure everyone wants to keep re in all 3.x releases, but that support need not extend beyond. So Py4 would have no battery for re, but it would (should!) be common knowledge that regex was the go-to module for general-purpose pattern matching. If re has advantages in certain situations someone might upgrade the 3.x implementation and provide it as a 3rd-party module, though the effort involved would be significant, so someone would have to be motivated to keep it.
Not having regex capability distributed with Python is a pretty big regression. There are still a lot of users who don't have access to 3rd party modules. Paul

On 29.01.17 22:30, Steve Holden wrote:
Why not declare re deprecated and remove it in Python 4? I am pretty sure everyone wants to keep re in all 3.x releases, but that support need not extend beyond. So Py4 would have no battery for re, but it would (should!) be common knowledge that regex was the go-to module for general-purpose pattern matching. If re has advantages in certain situations someone might upgrade the 3.x implementation and provide it as a 3rd-party module, though the effort involved would be significant, so someone would have to be motivated to keep it.
Regular expressions are used in a number of standard modules and scripts. Excluding them from the stdlib would require excluding or rewriting the large part of the stdlib.

On Sun, 29 Jan 2017 20:30:38 +0000 Steve Holden <steve@holdenweb.com> wrote:
Why not declare re deprecated and remove it in Python 4?
Why deprecate and remove a library that's perfectly usable and satisfactory for the vast majority of regular expression usage? It's not like regex presents a radically different API... Do we really have to face another wave of stdlib bashing on this mailing-list? Regards Antoine.

On Mon, Jan 30, 2017 at 2:56 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 29 Jan 2017 20:30:38 +0000 Steve Holden <steve@holdenweb.com> wrote:
Why not declare re deprecated and remove it in Python 4?
Why deprecate and remove a library that's perfectly usable and satisfactory for the vast majority of regular expression usage? It's not like regex presents a radically different API...
Do we really have to face another wave of stdlib bashing on this mailing-list?
I agree with Antoine. And note that re has an active maintainer who has fixed a bunch of bugs and added some new features in the last few years. --Berker

On 28 January 2017 at 18:07, Barry Warsaw <barry@python.org> wrote:
On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
I still think it could be a good candidate for a first "bundled" module, where we don't migrate it fully into the CPython development process, but *do* officially bless it and provide it by default in the form of a bundled wheel file (similar to what we do with pip).
How would that work exactly. I.e. is there a PEP?
There's no PEP, and I don't think it would make sense to have one without a specific concrete example to drive the discussion. I think there are 3 main candidates that could fit that bill: - requests - setuptools - regex Using requests to drive it would bring in a lot of incidental complexity around SSL/TLS certificate management and integrating with platform trust stores, which I think would overshadow the core policy discussion of switching from a pure co-development model for the standard library to one where some components are independently versioned with recent snapshots being included in CPython feature and maintenance branches. For setuptools, one of our main goals in PyPA is to make it just one build system option amongst many, as well as ensuring that it's easy to do service deployments that don't have any build system present at all, so bundling it by default with the stdlib would run counter to those goals regex by contrast is a relatively simple standalone module that offers a more actively developed alternative to the standard library's re module, but would still require a PEP to address the following technical and policy questions that are normally handled by copying code wholesale into the CPython tree under a contributor license agreement: - what licensing and other IP assurances would the PSF need to be comfortable redistributing the bundled component? - what maintenance sustainability questions would need to be addressed? (e.g. succession planning for the ability to publish new releases) - how would bundled modules be handled in the CPython test suite? - how would bundled modules be handled during local development, during installation on supported OSes, and when creating virtual environments? - how does platform support in bundled modules relate to the platform support guidelines given in PEP 11? - how are additional build requirements for bundled modules (e.g. C++ compilers) handled? - would bundled modules be permitted to gain features in CPython maintenance releases, or would we expect any bundled modules to offer conservative maintenance branches for the life of a given CPython feature release? - how would maintenance collaboration and issue escalation work? Would bundled module maintainers need to follow both their own issue tracker and the CPython one? In many respects, PEP 453 and the bundling of pip already offers precedent for how these are handled (e.g. it's collectively maintained by PyPA, and that group is as closely associated with the PSF from a legal and support perspective as python-dev is), so any PEP along these lines would be about deciding which parts of what we did for pip were a technical and policy precedent for 3rd party module bundling in general, and which are unique to pip.
While I think it could be a good idea to bundle (more?) third party packages for a variety of reasons, I want to make sure it's done in a way that's still friendly to downstream repackagers, as I'm sure you do to. :)
Indeed :)
For Debian/Ubuntu, we already have some additional machinations to make ensurepip and such work properly in a distro packaging environment. We'd want to be able to do the same for any bundled packages as well.
Yeah, I suspect any development along these lines would make it even more important to have something like PEP 534 in place so that the bundled modules could be omitted by default in some cases (such as in virtual environments), and require people to depend on them explicitly if they needed them. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
I think there are 3 main candidates that could fit that bill:
- requests - setuptools - regex
Actually, I think pkg_resources would make an excellent candidate. The setuptools crew is working on a branch that would allow for setuptools and pkg_resources to be split, which would be great for other reasons. Splitting them may mean that pkg_resources could eventually be added to the stdlib, but as an intermediate step, it could also test out this idea. It probably has a lot less of the baggage that you outline. Cheers, -Barry

On Mon, 30 Jan 2017 at 06:27 Barry Warsaw <barry@python.org> wrote:
On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
I think there are 3 main candidates that could fit that bill:
- requests - setuptools - regex
Actually, I think pkg_resources would make an excellent candidate. The setuptools crew is working on a branch that would allow for setuptools and pkg_resources to be split, which would be great for other reasons. Splitting them may mean that pkg_resources could eventually be added to the stdlib, but as an intermediate step, it could also test out this idea. It probably has a lot less of the baggage that you outline.
What functionality are you after here from pkg_resources? If it's the reading of data files then you will get that in the stdlib through importlib as soon as I'm not working on getting our workflow to work through GitHub's web UI (which obviously includes the migration itself).

On Jan 30, 2017, at 06:14 PM, Brett Cannon wrote:
What functionality are you after here from pkg_resources? If it's the reading of data files then you will get that in the stdlib through importlib as soon as I'm not working on getting our workflow to work through GitHub's web UI (which obviously includes the migration itself).
http://setuptools.readthedocs.io/en/latest/pkg_resources.html#basic-resource... Mostly I use: * resource_filename() * resource_string() (really, resource_bytes!) * resource_stream() (although I'd really like a more open()-like API) This might fall under a simpler PEP 365. Also, while I would love to have these available say in importlib, I also like to consider a backward compatible API where the stdlib provides the `pkg_resources` module name. That's not totally required though because you can play ImportError games until Python 3.7 (presumably) is in widespread -and only- use. Cheers, -Barry

Proper support in the loader API is possible and was on its way to being coded up until GitHub took precedence (realize this project has put all other major python projects of mine on hold for nearly 2 years hence the delay). And pkg_resources I'm sure could be updated to use any new API when available, so I don't think it's appropriate to bring the name into the stdlib. On Mon, Jan 30, 2017, 10:28 Barry Warsaw, <barry@python.org> wrote:
On Jan 30, 2017, at 06:14 PM, Brett Cannon wrote:
What functionality are you after here from pkg_resources? If it's the reading of data files then you will get that in the stdlib through importlib as soon as I'm not working on getting our workflow to work through GitHub's web UI (which obviously includes the migration itself).
http://setuptools.readthedocs.io/en/latest/pkg_resources.html#basic-resource...
Mostly I use:
* resource_filename() * resource_string() (really, resource_bytes!) * resource_stream() (although I'd really like a more open()-like API)
This might fall under a simpler PEP 365. Also, while I would love to have these available say in importlib, I also like to consider a backward compatible API where the stdlib provides the `pkg_resources` module name. That's not totally required though because you can play ImportError games until Python 3.7 (presumably) is in widespread -and only- use.
Cheers, -Barry _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Jan 30, 2017, at 6:26 AM, Barry Warsaw <barry@python.org> wrote:
Actually, I think pkg_resources would make an excellent candidate. The setuptools crew is working on a branch that would allow for setuptools and pkg_resources to be split, which would be great for other reasons. Splitting them may mean that pkg_resources could eventually be added to the stdlib, but as an intermediate step, it could also test out this idea. It probably has a lot less of the baggage that you outline.
FWIW this is what we do at Facebook. We ship pkg_resources with everything but not setuptools. So there's definitely value in that. It's a little surprising we haven't been doing that yet, given that zipped applications need it. - Ł

On 30 January 2017 at 15:26, Barry Warsaw <barry@python.org> wrote:
On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
I think there are 3 main candidates that could fit that bill:
- requests - setuptools - regex
Actually, I think pkg_resources would make an excellent candidate. The setuptools crew is working on a branch that would allow for setuptools and pkg_resources to be split, which would be great for other reasons. Splitting them may mean that pkg_resources could eventually be added to the stdlib, but as an intermediate step, it could also test out this idea. It probably has a lot less of the baggage that you outline.
Yep, if/when pkg_resources is successfully split out from the rest of setuptools, I agree it would also be a good candidate for stdlib bundling - version independent runtime access to the database of installed packages is a key capability for many use cases, and not currently something we support especially well. It's also far more analogous to the existing pip bundling, since setuptools/pkg_resources are also maintained under the PyPA structure. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Regarding to the performance difference between "re" and "regex" and packaging related options, we did a performance comparison using Python 3.6.0 to run some micro-benchmarks in the Python Benchmark Suite (https://github.com/python/performance): Results in ms, and the lower the better (running on Ubuntu 15.10) re regex (via pip install regex, and a replacement of "import re" with "import regex as re") bm_regex_compile.py 229 298 bm_regex_dna.py 171 267 bm_regex_effbot.py 2.77 3.04 bm_regex_v8.py 24.8 14.1 This data shows "re" is better than "regex" in term of performance in 3 out of 4 above micro-benchmarks. Anyone searching for "regular expression python" will get a first hit at the Python documentation on "re". Naturally, any new developer could start with "re" since day 1 and not bother to look elsewhere for alternatives later on. We did a query for "import re" against the big cloud computing software application, OpenStack (with 3.7 million lines of source codes and majority of them written in Python), and got ~1000 hits. With that being said, IMHO, it would be nice to capture ("borrow") the performance benefit from "regex" and merged into "re", without knowing or worrying about packaging/installing stuff. Cheers, Peter -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+peter.xihong.wang=intel.com@python.org] On Behalf Of Nick Coghlan Sent: Tuesday, January 31, 2017 1:54 AM To: Barry Warsaw <barry@python.org> Cc: python-dev@python.org Subject: Re: [Python-Dev] re performance On 30 January 2017 at 15:26, Barry Warsaw <barry@python.org> wrote:
On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
I think there are 3 main candidates that could fit that bill:
- requests - setuptools - regex
Actually, I think pkg_resources would make an excellent candidate. The setuptools crew is working on a branch that would allow for setuptools and pkg_resources to be split, which would be great for other reasons. Splitting them may mean that pkg_resources could eventually be added to the stdlib, but as an intermediate step, it could also test out this idea. It probably has a lot less of the baggage that you outline.
Yep, if/when pkg_resources is successfully split out from the rest of setuptools, I agree it would also be a good candidate for stdlib bundling - version independent runtime access to the database of installed packages is a key capability for many use cases, and not currently something we support especially well. It's also far more analogous to the existing pip bundling, since setuptools/pkg_resources are also maintained under the PyPA structure. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/peter.xihong.wang%40intel...

On Jan 31, 2017, at 11:40 AM, Wang, Peter Xihong <peter.xihong.wang@intel.com> wrote:
Regarding to the performance difference between "re" and "regex" and packaging related options, we did a performance comparison using Python 3.6.0 to run some micro-benchmarks in the Python Benchmark Suite (https://github.com/python/performance <https://github.com/python/performance>):
Results in ms, and the lower the better (running on Ubuntu 15.10) re regex (via pip install regex, and a replacement of "import re" with "import regex as re") bm_regex_compile.py 229 298 bm_regex_dna.py 171 267 bm_regex_effbot.py 2.77 3.04 bm_regex_v8.py 24.8 14.1 This data shows "re" is better than "regex" in term of performance in 3 out of 4 above micro-benchmarks.
This is very informative, thank you! This clearly shows we should rather pursue the PyPI route (with a documentation endorsement and possible bundling for 3.7) than full-blown replacement. However, this benchmark is incomplete in the sense that it only checks the compatibility mode of `regex`, whereas it's the new mode that lends the biggest performance gains. So, providing checks for the other engine would show us the full picture. We'd need to add checks that prove the regular expressions in said benchmarks end up with equivalent matches, to be sure we're testing the same thing. - Ł

2017-02-01 20:42 GMT+01:00 Lukasz Langa <lukasz@langa.pl>:
However, this benchmark is incomplete in the sense that it only checks the compatibility mode of `regex`, whereas it's the new mode that lends the biggest performance gains. So, providing checks for the other engine would show us the full picture.
Would you mind to write a pull request for performance to add a command line option to test "re" (stdlib) or "regex" (PyPI, in the new mode)? Or maybe even regex and regex_compatibility_mode :-) Source: https://github.com/python/performance/blob/master/performance/benchmarks/bm_... Example of benchmark with cmdline options: https://github.com/python/performance/blob/master/performance/benchmarks/bm_... Victor

+1 We'd like to get more details on how to try this "new mode", and do a full/comprehensive comparison between the "re" vs "regex". Peter -----Original Message----- From: Victor Stinner [mailto:victor.stinner@gmail.com] Sent: Wednesday, February 01, 2017 12:58 PM To: Lukasz Langa <lukasz@langa.pl> Cc: Wang, Peter Xihong <peter.xihong.wang@intel.com>; python-dev@python.org Subject: Re: [Python-Dev] re performance 2017-02-01 20:42 GMT+01:00 Lukasz Langa <lukasz@langa.pl>:
However, this benchmark is incomplete in the sense that it only checks the compatibility mode of `regex`, whereas it's the new mode that lends the biggest performance gains. So, providing checks for the other engine would show us the full picture.
Would you mind to write a pull request for performance to add a command line option to test "re" (stdlib) or "regex" (PyPI, in the new mode)? Or maybe even regex and regex_compatibility_mode :-) Source: https://github.com/python/performance/blob/master/performance/benchmarks/bm_... Example of benchmark with cmdline options: https://github.com/python/performance/blob/master/performance/benchmarks/bm_... Victor

On 31.01.17 21:40, Wang, Peter Xihong wrote:
Regarding to the performance difference between "re" and "regex" and packaging related options, we did a performance comparison using Python 3.6.0 to run some micro-benchmarks in the Python Benchmark Suite (https://github.com/python/performance):
Results in ms, and the lower the better (running on Ubuntu 15.10) re regex (via pip install regex, and a replacement of "import re" with "import regex as re") bm_regex_compile.py 229 298 bm_regex_dna.py 171 267 bm_regex_effbot.py 2.77 3.04 bm_regex_v8.py 24.8 14.1 This data shows "re" is better than "regex" in term of performance in 3 out of 4 above micro-benchmarks.
bm_regex_v8 is the one that is purposed to reflect real-world use of regular expressions. See also different comparison at https://mail.python.org/pipermail/speed/2016-March/000311.html. In some tests regex surpasses re, in other tests re surpasses regex. re2 is much faster than other engines in all tests except the one in which it is much slower (and this engine is the least featured).

On Fri, 27 Jan 2017 at 13:26 MRAB <python@mrabarnett.plus.com> wrote:
On 2017-01-27 17:03, Łukasz Langa wrote:
On Jan 26, 2017, at 5:16 PM, MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
So, it seems as if regex already uses a better algorithm although I couldn't find any reference to any regex theoretical framework like
dfa,
nfa, thompson multiple-state simulation or something.
It still uses backtracking, like in the re module.
What’s the status of regex inclusion in the stdlib?
I'm not bothered about it. It's quite a bit bigger than the re module, and, anyway, keeping it as a third-party module gives me more freedom to make updates, which are available for a range of Python versions.
Maybe regex should get a mention in the docs like requests does under urllib.request?

On 28.01.17 20:04, Brett Cannon wrote:
Maybe regex should get a mention in the docs like requests does under urllib.request?

Hi folks,
I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
However, reaching the chart in the lower third of the article, I saw Python 2.4 measured against a naive Thompson matching implementation. And I was surprised about how bad it performed compared to an unoptimized version of an older than dirt algorithm.
So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà, 100% cpu and no results so far. Nothing has changed at all since 2007.
import re re.match(r'a?'*30 + r'a'*30, 'a'*30) .... (still waiting)
Quoting from the article: " Some might argue that this test is unfair to the backtracking implementations, since it focuses on an uncommon corner case. This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy."
Victor, as the head of Python performance department, and Matthew, as the maintainer of the new regex module, what is your stance on this particular issue?
From my perspective, I can say, that regular expressions might worth optimizing especially for web applications (url matching usually uses regexes) but also for other applications where I've seen many tight loops using regexes as well. So, I am probing interest on this topic here. It's possible to add some checks to reduce the problem, as I've done in
On 2017-01-26 21:13, Sven R. Kunze wrote: the regex module, but I'm not sure how easy it would be to retrofit it into the existing re code. (It wasn't pleasant in the regex module, so...).

Hi Sven, On 26 January 2017 at 22:13, Sven R. Kunze <srkunze@mail.de> wrote:
I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
Theoretical regular expressions and what Python/Perl/etc. call regular expressions are a bit different. You can read more about it at https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running... . Discussions about why they are different often focus on backreferences, which is a rare feature. Let me add two other points. The theoretical kind of regexp is about giving a "yes/no" answer, whereas the concrete "re" or "regexp" modules gives a match object, which lets you ask for the subgroups' location, for example. Strange at it may seem, I am not aware of a way to do that using the linear-time approach of the theory---if it answers "yes", then you have no way of knowing *where* the subgroups matched. Another issue is that the theoretical engine has no notion of greedy/non-greedy matching. Basically, you walk over the source character and it answers "yes" or "no" after each of them. This is different from a typical backtracking implementation. In Python:
re.match(r'a*', 'aaa') re.match(r'a*?', 'aaa')
This matches either three or zero characters in Python. The two versions are however indistinguishable for the theoretical engine. A bientôt, Armin.

* Armin Rigo <armin.rigoatgmail.com>, 2017-01-28, 12:44:
The theoretical kind of regexp is about giving a "yes/no" answer, whereas the concrete "re" or "regexp" modules gives a match object, which lets you ask for the subgroups' location, for example. Strange at it may seem, I am not aware of a way to do that using the linear-time approach of the theory---if it answers "yes", then you have no way of knowing *where* the subgroups matched.
Another issue is that the theoretical engine has no notion of greedy/non-greedy matching.
RE2 has linear execution time, and it supports both capture groups and greedy/non-greedy matching. The implementation is explained in this article: https://swtch.com/~rsc/regexp/regexp3.html -- Jakub Wilk

On 29.01.17 12:18, Jakub Wilk wrote:
* Armin Rigo <armin.rigoatgmail.com>, 2017-01-28, 12:44:
The theoretical kind of regexp is about giving a "yes/no" answer, whereas the concrete "re" or "regexp" modules gives a match object, which lets you ask for the subgroups' location, for example. Strange at it may seem, I am not aware of a way to do that using the linear-time approach of the theory---if it answers "yes", then you have no way of knowing *where* the subgroups matched.
Another issue is that the theoretical engine has no notion of greedy/non-greedy matching.
RE2 has linear execution time, and it supports both capture groups and greedy/non-greedy matching.
The implementation is explained in this article: https://swtch.com/~rsc/regexp/regexp3.html
Not all features of Python regular expressions can be implemented with linear complexity. It is possible to compile the part of regular expressions to the implementation with linear complexity. Patches are welcome.

Armin Rigo wrote:
The theoretical kind of regexp is about giving a "yes/no" answer, whereas the concrete "re" or "regexp" modules gives a match object, which lets you ask for the subgroups' location, for example.
Another issue is that the theoretical engine has no notion of greedy/non-greedy matching.
These things aren't part of the classical theory of REs that is usually taught, but it should be possible to do them in linear time. They can be done for context-free languages using e.g. an LALR parser, and regular languages are a subset of context-free languages. -- Greg

On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hi folks,
I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
However, reaching the chart in the lower third of the article, I saw Python 2.4 measured against a naive Thompson matching implementation. And I was surprised about how bad it performed compared to an unoptimized version of an older than dirt algorithm.
From my perspective, I can say, that regular expressions might worth optimizing especially for web applications (url matching usually uses regexes) but also for other applications where I've seen many tight loops using regexes as well. So, I am probing interest on this topic here.
What I (think I) know: - Both re and regex use the same C backend, which is not based on NFA. - The re2 library, which the writer of that article made, allows capture groups (but only up to a limit) and bounded repetitions (up to a limit). - Perl has started to optimize some regex patterns. What I think: - The example is uncharacteristic of what people write, like URL matching. - But enabling naive code to perform well is usually a good thing. The fewer details newbies need to know to write code, the better. - It's possible for Python to optimize a large category of patterns, even if not all. - It's also possible to optimize large parts of patterns with backrefs. E.g. If there's a backref, the group that the backref refers to can still be made into an NFA. - To do the above, you'd need a way to generate all possible matches. - Optimization can be costly. The full NFA construction could be generated only upon request, or maybe the code automatically tries to optimize after 100 uses (like a JIT). This should only be considered if re2's construction really is costly. If people want NFAs, I think the "easiest" way is to use re2. Jakub Wilk mentioned it before, but here it is again. https://github.com/google/re2 re2 features: https://github.com/google/re2/wiki/Syntax Named references aren't supported, but I think that can be worked around with some renumbering. It's just a matter of translating the pattern. It also doesn't like lookarounds, which AFAIK are perfectly doable with NFAs. Looks like lookaheads and lookbehinds are hard to do without a lot of extra space (https://github.com/google/re2/issues/5). Facebook has a Python wrapper for re2. https://github.com/facebook/pyre2/ In a post linked to from this thread, Serhiy mentioned another Python wrapper for re2. This wrapper is designed to be like re, and should probably be the basis of any efforts. It's not been updated for 11 months, though. https://pypi.python.org/pypi/re2/ https://github.com/axiak/pyre2/

On 2017-02-02 04:37, Franklin? Lee wrote:
On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hi folks,
I recently refreshed regular expressions theoretical basics *indulging in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
However, reaching the chart in the lower third of the article, I saw Python 2.4 measured against a naive Thompson matching implementation. And I was surprised about how bad it performed compared to an unoptimized version of an older than dirt algorithm.
From my perspective, I can say, that regular expressions might worth optimizing especially for web applications (url matching usually uses regexes) but also for other applications where I've seen many tight loops using regexes as well. So, I am probing interest on this topic here.
What I (think I) know: - Both re and regex use the same C backend, which is not based on NFA. - The re2 library, which the writer of that article made, allows capture groups (but only up to a limit) and bounded repetitions (up to a limit). - Perl has started to optimize some regex patterns.
[snip] re and regex use different C backends. Both are based on NFA. re2 is based on DFA, with a fallback to re.
participants (22)
-
Antoine Pitrou
-
Armin Rigo
-
Barry Warsaw
-
Berker Peksağ
-
Brett Cannon
-
Franklin? Lee
-
Greg Ewing
-
Jakub Wilk
-
Lukasz Langa
-
Mark Lawrence
-
Michael Felt
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Serhiy Storchaka
-
Steve Holden
-
Sven R. Kunze
-
Terry Reedy
-
Victor Stinner
-
Vlastimil Brom
-
Wang, Peter Xihong
-
Łukasz Langa