Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name "regex" so that it can be tried alongside "re". I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2?
On Fri, Jul 9, 2010 at 5:52 AM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
How much interest would there be in putting it in Python 3.2?
The list of fixed bugs/new features is certainly impressive. How does Python's test suite go if you drop it in place of the current "re" module? (Ditto for test suites of major applications and frameworks like Django, etc). Off the top of my head, I would say that this won't have enough time to bake properly for inclusion in 3.2, but if the potential benefits and intended backwards compatibility are borne out by real world usage and the code fares well under review then it may be a contender for 3.3. If the backwards compatibility isn't quite there (and can't be improved), then adding it under a name other than "re" wouldn't be impossible, but it would be a harder sell. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
On Fri, Jul 9, 2010 at 5:52 AM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
How much interest would there be in putting it in Python 3.2?
The list of fixed bugs/new features is certainly impressive. How does Python's test suite go if you drop it in place of the current "re" module? (Ditto for test suites of major applications and frameworks like Django, etc).
Off the top of my head, I would say that this won't have enough time to bake properly for inclusion in 3.2, but if the potential benefits and intended backwards compatibility are borne out by real world usage and the code fares well under review then it may be a contender for 3.3. If the backwards compatibility isn't quite there (and can't be improved), then adding it under a name other than "re" wouldn't be impossible, but it would be a harder sell.
You should be able to replace: import re with: import regex as re and still have everything work the same, ie it's backwards compatible with re.
On Fri, Jul 9, 2010 at 7:54 AM, MRAB <python@mrabarnett.plus.com> wrote:
You should be able to replace:
import re
with:
import regex as re
and still have everything work the same, ie it's backwards compatible with re.
That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as "re.py". Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
On Fri, Jul 9, 2010 at 7:54 AM, MRAB <python@mrabarnett.plus.com> wrote:
You should be able to replace:
import re
with:
import regex as re
and still have everything work the same, ie it's backwards compatible with re.
That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as "re.py".
Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible.
I had to recompile the .pyd to change its internal name from "regex" to "re", but apart from that it passed Python's own test suite except for where I expected it to fail: 1. Some of the inline flags are scoped; for example, putting "(?i)" at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag. 2. The .sub method will treat unmatched groups in an expansion as empty strings. The re module raises an exception in such cases, which means that users currently need a workaround.
Am 09.07.2010 02:35, schrieb MRAB:
That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as "re.py".
Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible.
I had to recompile the .pyd to change its internal name from "regex" to "re", but apart from that it passed Python's own test suite except for where I expected it to fail:
1. Some of the inline flags are scoped; for example, putting "(?i)" at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag.
That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups.
2. The .sub method will treat unmatched groups in an expansion as empty strings. The re module raises an exception in such cases, which means that users currently need a workaround.
That is a change for the better, I would say. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
2010/7/9 Georg Brandl <g.brandl@gmx.net>:
Am 09.07.2010 02:35, schrieb MRAB:
1. Some of the inline flags are scoped; for example, putting "(?i)" at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag.
That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups.
I just noticed the formulation on the reference page regular-expressions.info on this kind of flags: "(?i) Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)" and likewise for other flags. http://www.regular-expressions.info/refadv.html I am not sure, how "authoritative" this page by Jan Goyvaerts is for various implementations, but it looks like a very comprehensive reference. I think with a new regex implementation, not all of this "historical" semantics must be copied, unless there are major real usecases, which would be affected by this. Just a thought; Vlastimil Brom
Am 16.07.2010 17:08, schrieb Vlastimil Brom:
2010/7/9 Georg Brandl <g.brandl@gmx.net>:
Am 09.07.2010 02:35, schrieb MRAB:
1. Some of the inline flags are scoped; for example, putting "(?i)" at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag.
That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups.
I just noticed the formulation on the reference page regular-expressions.info on this kind of flags: "(?i) Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)" and likewise for other flags.
http://www.regular-expressions.info/refadv.html
I am not sure, how "authoritative" this page by Jan Goyvaerts is for various implementations, but it looks like a very comprehensive reference.
Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex.
I think with a new regex implementation, not all of this "historical" semantics must be copied, unless there are major real usecases, which would be affected by this.
As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl <g.brandl@gmx.net> wrote:
Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex.
I think with a new regex implementation, not all of this "historical" semantics must be copied, unless there are major real usecases, which would be affected by this.
As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable.
I wonder if the solution could be a new flag that you have to specify if you want the flags to apply locally only. Let's say add 'L' to the flags. Then for existing code (not using 'L') the flags should apply globally, while someone wanting flags to apply to only part of a regex can add 'L' to the flags syntax. Is that acceptable? -- --Guido van Rossum (python.org/~guido)
Am 22.07.2010 12:53, schrieb Guido van Rossum:
On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl <g.brandl@gmx.net> wrote:
Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex.
I think with a new regex implementation, not all of this "historical" semantics must be copied, unless there are major real usecases, which would be affected by this.
As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable.
I wonder if the solution could be a new flag that you have to specify if you want the flags to apply locally only. Let's say add 'L' to the flags. Then for existing code (not using 'L') the flags should apply globally, while someone wanting flags to apply to only part of a regex can add 'L' to the flags syntax.
Is that acceptable?
I guess it would be; however it remains to be proven that this scoping is actually needed in addition to the (backwards compatible) grouped scoping. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
2010/7/8 MRAB <python@mrabarnett.plus.com>:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
How much interest would there be in putting it in Python 3.2?
It would really be nice if you explained incrementally everything you changed, so it could better be evaluated. -- Regards, Benjamin
On Thu, 08 Jul 2010 20:52:44 +0100 MRAB <python@mrabarnett.plus.com> wrote:
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
Can you publish these benchmarks somewhere? (or send them here)
How much interest would there be in putting it in Python 3.2?
I think there's certainly interest (especially given that the original re module doesn't really have an expert and active maintainer). Since it's a very big change (and a rather annoying to undo one), though, it must really not add any maintenance problems, and ideally you should promise to maintain it at least for a couple of years. Bonus points if the internals are sufficiently documented, too. Thanks Antoine.
On Thu, Jul 8, 2010 at 10:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ -- anatoly t.
On Fri, Jul 9, 2010 at 7:06 AM, anatoly techtonik <techtonik@gmail.com> wrote:
On Thu, Jul 8, 2010 at 10:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/
While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard.
On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin <jyasskin@gmail.com> wrote:
While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard.
re2 comparison is interesting from the point of if it should be included in stdlib. -- anatoly t.
On 7/11/2010 5:19 AM, anatoly techtonik wrote:
On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin<jyasskin@gmail.com> wrote:
While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard.
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib. Eric.
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. -- Steven D'Aprano
On Sun, Jul 11, 2010 at 7:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated.
While I realize I'm not the target audience for this, there are a lot of things I'd like to see more in the stdlib than a second re engine. Geremy Condra
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated.
re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 12/07/2010 15:07, Nick Coghlan wrote:
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano<steve@pearwood.info> wrote:
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated.
re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves.
If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer... Michael
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
Cheers, Nick.
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On Mon, 12 Jul 2010 16:18:38 +0100 Michael Foord <fuzzyman@voidspace.org.uk> wrote:
On 12/07/2010 15:07, Nick Coghlan wrote:
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano<steve@pearwood.info> wrote:
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated.
re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves.
If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer...
You're forgetting the maintenance cost. Regards Antoine.
On Mon, 2010-07-12 at 16:18 +0100, Michael Foord wrote:
On 12/07/2010 15:07, Nick Coghlan wrote:
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano<steve@pearwood.info> wrote:
re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves.
If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer...
I'm not sure how common those cases are - I played around with RE2 a few months ago and found that the majority of regular expressions in my own code were noticeably slower running under RE2 than python's re module - RE2 just puts much nicer theoretical bounds on cases that were (in my code at least) unusual. The really good use case for RE2 is for applications where users can write regular expressions as input (exactly what RE2 was designed for) - but I'm not sure how common those applications are. Tim Wintle
On Mon, Jul 12, 2010 at 8:18 AM, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
On 12/07/2010 15:07, Nick Coghlan wrote:
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano<steve@pearwood.info> wrote:
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
re2 comparison is interesting from the point of if it should be included in stdlib.
Is "it" re2 or regex? I don't see having 2 regular expression engines in the stdlib.
There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated.
re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves.
If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer...
FWIW, this has all been discussed before: http://aspn.activestate.com/ASPN/Mail/Message/python-dev/3829265. In particular, I still believe that, "it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.)" Collin
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
Cheers, Nick.
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/collinwinter%40google.com
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects. Reid
On 13/07/2010 15:17, Reid Kleckner wrote:
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects.
If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me. Michael
Reid _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
On Tue, 13 Jul 2010 15:20:23 +0100 Michael Foord <fuzzyman@voidspace.org.uk> wrote:
On 13/07/2010 15:17, Reid Kleckner wrote:
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects.
If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me.
To me as well. The code needs a full review before integrating, though. Regards Antoine.
Am 13.07.2010 15:35, schrieb Antoine Pitrou:
On Tue, 13 Jul 2010 15:20:23 +0100 Michael Foord <fuzzyman@voidspace.org.uk> wrote:
On 13/07/2010 15:17, Reid Kleckner wrote:
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan<ncoghlan@gmail.com> wrote:
MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look.
I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects.
If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me.
To me as well. The code needs a full review before integrating, though.
FWIW, I've now run the Pygments test suite (Pygments has about 2500 regular expressions that are exercised there) and only had two problems: * Scoped flags: A few lexers use (?s) and similar flags at the end of the expression, which has no effect in regex currently. * POSIX character classes: One regex used a class '[][:xyz]', so the [: was seen as the start of a character class. I'm not sure how common this is, as most people seem to escape brackets in character classes. Also, it gives a clear error on regex.compile(), not "mysterious" failures. Timings (seconds to run the test suite): re 26.689 26.015 26.008 regex 26.066 25.797 25.865 So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl <g.brandl@gmx.net> wrote:
So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re.
That still fits with the compile/match performance trade-off changes between re and regex though. It does make it clear this isn't going to be a win across the board though - things like test suites are going to have more one-off regex operations than a long-running web server or a filesystem or database scanning operation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Am 22.07.2010 14:12, schrieb Nick Coghlan:
On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl <g.brandl@gmx.net> wrote:
So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re.
That still fits with the compile/match performance trade-off changes between re and regex though. It does make it clear this isn't going to be a win across the board though - things like test suites are going to have more one-off regex operations than a long-running web server or a filesystem or database scanning operation.
Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Thu, Jul 22, 2010 at 7:42 AM, Georg Brandl <g.brandl@gmx.net> wrote:
Am 22.07.2010 14:12, schrieb Nick Coghlan:
On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl <g.brandl@gmx.net> wrote:
So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re.
It might be fun to do a pygments based macro benchmark. Just have it syntax highlight itself and time it.
Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element.
+50 for the last idea. Collin encountered a problem two summers ago in Mondrian where we were relying on the regex cache and were surprised to find that it cleared itself after filling up, rather than using LRU or random eviction. Reid
On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl <g.brandl@gmx.net> wrote:
Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element.
Yikes, I didn't know it did that. That certainly sounds like it should be an RFE in its own right - some basic form of Least Recently Used accounting should be beneficial (although the extra bookkeeping might hurt scripts that aren't hitting the cache limit). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Jul 22, 2010 at 3:26 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl <g.brandl@gmx.net> wrote:
Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element.
Yikes, I didn't know it did that. That certainly sounds like it should be an RFE in its own right - some basic form of Least Recently Used accounting should be beneficial (although the extra bookkeeping might hurt scripts that aren't hitting the cache limit).
A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow. With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%. Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. http://svn.python.org/view?view=rev&revision=83173 This change should be incorporated into MRAB's regex module in order to keep comparisons fair. -gps
Gregory P. Smith, 27.07.2010 07:40:
A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow.
With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%.
Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully.
The same algorithm should be helpful in ElementTree's ElementPath module. Stefan
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel <stefan_ml@behnel.de> wrote:
Gregory P. Smith, 27.07.2010 07:40:
A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow.
With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%.
Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully.
The same algorithm should be helpful in ElementTree's ElementPath module.
We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well. That's three...time to figure out how to share the code? -- R. David Murray www.bitdance.com
R. David Murray, 28.07.2010 03:43:
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel wrote:
Gregory P. Smith, 27.07.2010 07:40:
A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow.
With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%.
Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully.
The same algorithm should be helpful in ElementTree's ElementPath module.
We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well.
That's three...time to figure out how to share the code?
What about actually putting it visibly into the stdlib? Except for files, I didn't see much about caching there, which seems like a missing battery to me. Why not do it as with the collections module and add stuff as it comes in? Stefan
On Tue, Jul 27, 2010 at 10:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
What about actually putting it visibly into the stdlib? Except for files, I didn't see much about caching there, which seems like a missing battery to me. Why not do it as with the collections module and add stuff as it comes in?
Caching is a rather large topic to focus on in a single stdlib module or package. It means radically different things to different people. The pattern we're talking about there is probably more something like memoization, so 'memo' or 'memoization' could b a reasonable module name. Maybe it could provide a decorator that you configure with things like how many entries max, the expiration policy (or different policies could have different decorators), and how to compute the memo key from the arguments to the function (the default could be the repr() of the argument tuple). -- --Guido van Rossum (python.org/~guido)
On Jul 28, 2010, at 7:31 AM, Guido van Rossum wrote:
On Tue, Jul 27, 2010 at 10:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
What about actually putting it visibly into the stdlib? Except for files, I didn't see much about caching there, which seems like a missing battery to me. Why not do it as with the collections module and add stuff as it comes in?
Caching is a rather large topic to focus on in a single stdlib module or package. It means radically different things to different people.
The pattern we're talking about there is probably more something like memoization, so 'memo' or 'memoization' could b a reasonable module name. Maybe it could provide a decorator that you configure with things like how many entries max, the expiration policy (or different policies could have different decorators), and how to compute the memo key from the arguments to the function (the default could be the repr() of the argument tuple).
FWIW, this has been a popular caching/memoization recipe: http://code.activestate.com/recipes/498245-lru-cache-decorator Raymond
I think this is better suited for python-ideas, so moving it there. Guido van Rossum, 28.07.2010 16:31:
On Tue, Jul 27, 2010 at 10:29 PM, Stefan Behnel wrote:
What about actually putting it visibly into the stdlib? Except for files, I didn't see much about caching there, which seems like a missing battery to me. Why not do it as with the collections module and add stuff as it comes in?
Caching is a rather large topic to focus on in a single stdlib module or package. It means radically different things to different people.
Sure. All I would like to see in the stdlib is a dict based cache with a bounded size and a set of different policies that keep it at that size. Maybe a BoundedDict would match that need.
The pattern we're talking about there is probably more something like memoization, so 'memo' or 'memoization' could b a reasonable module name.
That would be a second interface to it, I'd say. The normal interface would just be the regular mapping interface, maybe even just a .get() method instead of a KeyError raising lookup.
Maybe it could provide a decorator that you configure with things like how many entries max, the expiration policy (or different policies could have different decorators), and how to compute the memo key from the arguments to the function (the default could be the repr() of the argument tuple).
Absolutely. Stefan
On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray <rdmurray@bitdance.com>wrote:
Gregory P. Smith, 27.07.2010 07:40:
A max cache size of 100 was too small. I just increased it to 500 in
py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow.
With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%.
Random replacement without dropping everything at least means apps
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel <stefan_ml@behnel.de> wrote: the thrashing
the cache degrade much more gracefully.
The same algorithm should be helpful in ElementTree's ElementPath module.
We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well.
That's three...time to figure out how to share the code?
No doubt. Its already a standalone _shrink_cache function with unit tests that doesn't care the dictionaries it is shrinking are composed of. Easy enough to move somewhere more useful. Any proposed stdlib locations? I'll be offline on vacation soon so I may not get to it for a couple weeks but feel free to move it without me if anyone is interested. -gps
On Wed, Jul 28, 2010 at 4:50 PM, Gregory P. Smith <greg@krypto.org> wrote:
On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray <rdmurray@bitdance.com> wrote:
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel <stefan_ml@behnel.de> wrote:
Gregory P. Smith, 27.07.2010 07:40:
Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully.
The same algorithm should be helpful in ElementTree's ElementPath module.
We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well.
That's three...time to figure out how to share the code?
No doubt.
Anyone remember off the top of their head what linecache does? ... ah, it's just unbounded unless you call clearcache(). Probably OK for that particular application.
Its already a standalone _shrink_cache function with unit tests that doesn't care the dictionaries it is shrinking are composed of. Easy enough to move somewhere more useful. Any proposed stdlib locations? I'll be offline on vacation soon so I may not get to it for a couple weeks but feel free to move it without me if anyone is interested.
I created a tracker issue for the idea so we don't forget about it: http://bugs.python.org/issue9396 collections seems like a fairly obvious possibility, but the flavour doesn't quite seem right for that. Nowhere else springs to mind though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 07/22/2010 01:34 PM, Georg Brandl wrote:
Timings (seconds to run the test suite):
re 26.689 26.015 26.008 regex 26.066 25.797 25.865
So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re.
This seems to point to re being significantly *faster* than regexp, even in matching, and as such may be something the author would want to look into. Nick writes:
That still fits with the compile/match performance trade-off changes between re and regex though.
The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache?
On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic <hrvoje.niksic@avl.com> wrote:
The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache?
That's not quite accurate. The figure that matters is "matches per compilation", and that is ultimately governed by the nature of the application once the cache is sufficiently large (e.g. an application that compiles 10 different regular expressions, but also only matches each one against a single string is going to be slowed down significantly by a switch to regex no matter how big the compilation cache may be). The total runtime for a particular re matching exercise is "(average compilation time)*(number of times compiled) + (average match time)*(number of matches performed)" We know that the new regex module speeds up matching at the cost of slowing down compilation. Whether that is a net increase for the application as a whole depends on 4 things: X: the average speed change in compilation for the application Y: the average speed change in matching for the application A: the proportion of time the application currently spends compiling regular expressions B: the proportion of time the application currently spends matching regular expressions The overall proportional speed change for re operations in a given application is then just X*A + Y*B. For regex to be an improvement for that application, the resulting figure needs to be less than 1. For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on average, but compilation also takes twice as long), then the overall speed change would be 2*A + 0.5*B. For that to be a wash, the original application needs to spend 1/3 of its expression matching time compiling the expressions and 2/3 actually matching them (the application with the new engine would then spend 2/3 of its time compiling and only 1/3 matching). That's going to be a tough assessment to make in practice, but I think Georg's pygments test suite example shows that there is real world code out there that already spends a lot of time on compilation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Am 23.07.2010 12:20, schrieb Nick Coghlan:
On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic <hrvoje.niksic@avl.com> wrote:
The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache?
That's not quite accurate. The figure that matters is "matches per compilation", and that is ultimately governed by the nature of the application once the cache is sufficiently large (e.g. an application that compiles 10 different regular expressions, but also only matches each one against a single string is going to be slowed down significantly by a switch to regex no matter how big the compilation cache may be).
The total runtime for a particular re matching exercise is "(average compilation time)*(number of times compiled) + (average match time)*(number of matches performed)"
We know that the new regex module speeds up matching at the cost of slowing down compilation. Whether that is a net increase for the application as a whole depends on 4 things:
X: the average speed change in compilation for the application Y: the average speed change in matching for the application A: the proportion of time the application currently spends compiling regular expressions B: the proportion of time the application currently spends matching regular expressions
The overall proportional speed change for re operations in a given application is then just X*A + Y*B. For regex to be an improvement for that application, the resulting figure needs to be less than 1.
For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on average, but compilation also takes twice as long), then the overall speed change would be 2*A + 0.5*B. For that to be a wash, the original application needs to spend 1/3 of its expression matching time compiling the expressions and 2/3 actually matching them (the application with the new engine would then spend 2/3 of its time compiling and only 1/3 matching).
That's going to be a tough assessment to make in practice, but I think Georg's pygments test suite example shows that there is real world code out there that already spends a lot of time on compilation.
And for Pygments, it's not only the case for the test suite, since many applications consist of highlighting one file/one piece of code calling it as a program. When using Pygments as a library, regexes are only compiled once per lexer class, so e.g. in a web appliation that highlights files (like a repository browser), the additional cost is negligible. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
Am 23.07.2010 11:16, schrieb Hrvoje Niksic:
On 07/22/2010 01:34 PM, Georg Brandl wrote:
Timings (seconds to run the test suite):
re 26.689 26.015 26.008 regex 26.066 25.797 25.865
So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re.
This seems to point to re being significantly *faster* than regexp, even in matching, and as such may be something the author would want to look into.
Nick writes:
That still fits with the compile/match performance trade-off changes between re and regex though.
The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache?
I did measure that, and there are no significant differences in timing. I also did the check the other way around, and restricting regex._MAXCACHE to 100 I got from 26 seconds to 42 seconds. (Nick, is that enough data to calculate A and B now? ;) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Sun, Jul 11, 2010 at 7:19 PM, anatoly techtonik <techtonik@gmail.com> wrote:
On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin <jyasskin@gmail.com> wrote:
While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard.
re2 comparison is interesting from the point of if it should be included in stdlib.
No it isn't. re2 is a wrapper for a third party engine that surrenders some functionality in the pursuit of better bounds on memory and CPU usage. It is not a drop-in replacement for re and cannot be by design: "The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently." There is no reason to have two distinct regex engines in the standard library - if someone knows enough to realise they need the performance assurances of re2, they're also likely to be able to find the Python wrappers for it. regex is potentially interesting for the standard library as it *is* intended to be a drop-in replacement for re that trades longer compilation times (typically once per application) for faster match times (potentially many times per application). The performance of re2 has nothing to do with the comparison between the current re module and MRAB's regex module. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
anatoly techtonik wrote:
On Thu, Jul 8, 2010 at 10:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/
The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages: Python26 BENCHMARK re regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80 Python31 BENCHMARK re regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87
On Fri, Jul 9, 2010 at 10:28 AM, MRAB <python@mrabarnett.plus.com> wrote:
anatoly techtonik wrote:
On Thu, Jul 8, 2010 at 10:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/
The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages:
Python26 BENCHMARK re regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80
Python31 BENCHMARK re regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87
Out of curiosity, what are the results for the bm_regex_compile benchmark? Collin
Collin Winter wrote:
On Fri, Jul 9, 2010 at 10:28 AM, MRAB <python@mrabarnett.plus.com> wrote:
anatoly techtonik wrote:
On Thu, Jul 8, 2010 at 10:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/
The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages:
Python26 BENCHMARK re regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80
Python31 BENCHMARK re regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87
Out of curiosity, what are the results for the bm_regex_compile benchmark?
I concentrated my efforts on the matching speed because regexes tend to be compiled only once, and are cached anyway, so I don't think it's as important. The results are: Python26 BENCHMARK re regex ratio bm_regex_compile 0.897secs 2.792secs 0.32 Python31 BENCHMARK re regex ratio bm_regex_compile 0.902secs 2.731secs 0.33 If anyone can demonstrate that it'll have a significant impact in practice then I will, of course, look into it further.
On Fri, Jul 9, 2010 at 3:35 PM, MRAB <python@mrabarnett.plus.com> wrote:
I concentrated my efforts on the matching speed because regexes tend to be compiled only once, and are cached anyway, so I don't think it's as important.
I think most here will agree with that, but it might be good to keep in mind that the sre implementation has already made that trade-off once. I don't remember what the compilation slowdown was at the time, but I'd be surprised if Google can't find it, given sufficient fu. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "A storm broke loose in my mind." --Albert Einstein
On Thu, Jul 8, 2010 at 12:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
How much interest would there be in putting it in Python 3.2?
A big +1 from me. I'm all for putting it in Python 3.2 alphas and beta1 as a *replacement* for the existing 're' module in order to get it visibility and a shakeout and make this available to the world sooner rather than later. This module was designed for compatibility for exactly that reason. Lets give it a high publicity fair shake it iron out any unknown compatibility issues early on so that we can decide how to proceed. Also, +1 to Antoine's point on volunteering to be the maintainer of it for the next couple major releases. (I do expect someone to balk at this idea with "but python 3.x doesn't have enough users yet for a beta cycle to shake out hidden regular expression engine assumption bugs" to which I'll respond in advance with "thats a good thing; we can add a regular expression sniffer to 2to3 that flags potential problems in code being ported over") -gps
2010/7/8 MRAB <python@mrabarnett.plus.com>:
Hi all,
I re-implemented the re module, adding new features and speed improvements. It's available at:
http://pypi.python.org/pypi/regex
under the name "regex" so that it can be tried alongside "re".
I'd be interested in any comments or feedback. How does it compare with "re" in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable.
How much interest would there be in putting it in Python 3.2?
Hi, please, let me apologize for posting here, not being a python developer; I'd like to support the inclusion of the new regex library in the standard lib. I use it since the early development versions in my internal app for searching, modifying, ordering, extracting data from text - mainly using the manually created regex patterns. I see, that it is only one specific usecase, and the app isn't time or space critical (input texts up to a few MB, mostly smaller; the processing time is often rather negligible compared to the gui presentation, styling etc.) However, I see it as a great enhancement both in terms of regex features (listed on http://pypi.python.org/pypi/regex ) as well as the behaviour in some cornercases, which aren't effectively usable in the current re (e.g. nested subexpressions with quantifiers - while many of these are more effectively solved with the added possesive quantifiers). I think, this is a far more feature complete engine, which doesn't induce any significant drawbacks (IMO) comparing to the current re and is backward compatible. (The mentioned exception in the scoped flags might be fixable by allowing only explicit scoping (?flags)...(?-flags) or using the current paren, if possible (?flag:...) and treating the bare flag setting parens as global; however, I would consider it quite misleading in the current re, if these flags are set at some other place than at the beginning of the pattern. I don't see the readability enhanced in any way with these flags set at the end, while they should apply from the beginning of the pattern.) Having seen MRABs commitment in the development phase in both - bugfixes and feature additions - including some rather complex ones (in my opinion) like unicode properties, I am also confident, that he could be a competent maintainer of this package in the standardlib as well. just my biased opinion... Regards, Vlastimil Brom
participants (22)
-
anatoly techtonik
-
Antoine Pitrou
-
Benjamin Peterson
-
Collin Winter
-
Eric Smith
-
Fred Drake
-
Georg Brandl
-
geremy condra
-
Gregory P. Smith
-
Guido van Rossum
-
Hrvoje Niksic
-
Jeffrey Yasskin
-
Michael Foord
-
MRAB
-
Nick Coghlan
-
R. David Murray
-
Raymond Hettinger
-
Reid Kleckner
-
Stefan Behnel
-
Steven D'Aprano
-
Tim Wintle
-
Vlastimil Brom