Guido asked me to update my old SRE benchmarks, and post them to python-dev. Summary: -- SRE is usually faster than the old RE module (PRE). -- SRE is faster than REGEX on anything but very trivial patterns and short target strings. And in some cases, it's even faster than a corresponding string.find... -- on real-life benchmarks like XML parsing and Python tokenizing, SRE is 2-3 times faster than PRE. -- using Unicode strings instead of 8-bit strings doesn't hurt performance (for some tests, the Unicode version is 30-40% faster on my machine. Go figure...) -- PRE is still faster for some patterns, especially when using long target strings. I know why, and I plan to fix that before 2.0 final. enjoy /F -------------------------------------------------------------------- These tests were made on a P3/233 MHz running Windows 95, using a local build of the 0.9.8 release (this will go into 1.6b1, I suppose). -------------------------------------------------------------------- parsing xml: running xmllib.py on hamlet.xml (280k): sre8 7.14 seconds sre16 7.82 seconds pre 17.17 seconds (for the sre16 test, the xml file was converted to unicode before it was fed to the unmodified parser). for comparision, here's the results for a couple of fast pure-Python parsers: rex/pre 2.44 seconds rex/sre 0.59 seconds srex/sre 0.16 seconds (rex is a shallow XML parser, based on code by Robert Cameron. srex is an even simpler shallow parser, using sre's template mode). -------------------------------------------------------------------- parsing python: running tokenize.py on Tkinter.py (156k): sre8 3.23 seconds pre 7.57 seconds -------------------------------------------------------------------- searching for literal text: searching for "spam" in a string padded with "spaz" (1000 bytes on each side of the target): string.find 0.112 ms sre8.search 0.059 pre.search 0.122 unicode.find 0.130 sre16.search 0.065 (yes, regular expressions can run faster than optimized C code -- as long as we don't take compilation time into account ;-) same test, without any false matches: string.find 0.035 ms sre8.search 0.050 pre.search 0.116 unicode.find 0.031 sre16.search 0.055 -------------------------------------------------------------------- compiling regular expressions compiling the 480 tests in the standard test suite: sre 1.22 seconds pre 0.05 seconds or in other words, pre (using a compiler written in C) can compile just under 10,000 patterns per second. sre can only compile about 400 pattern per second. do we care? ;-) (footnote: sre's pattern cache stores 100 patterns. pre's cache hold 20 patterns, iirc). -------------------------------------------------------------------- benchmark suite to round off this report, here's a couple of "micro benchmarks". all times are in milliseconds. n= 0 5 50 250 1000 5000 ----- ----- ----- ----- ----- ----- ----- pattern 'Python|Perl', string '-'*n+'Perl'+'-'*n sre8 0.014 0.013 0.013 0.016 0.027 0.079 sre16 0.014 0.014 0.015 0.018 0.025 0.076 pre 0.107 0.109 0.114 0.116 0.135 0.259 regex 0.011 0.011 0.012 0.016 0.033 0.122 pattern 'Python|Perl', string 'P'*n+'Perl'+'P'*n sre8 0.013 0.016 0.030 0.100 0.358 1.716 sre16 0.014 0.015 0.030 0.094 0.347 1.649 pre 0.115 0.112 0.158 0.351 1.085 5.002 regex 0.010 0.016 0.060 0.271 1.022 5.162 (false matches causes problems for pre and regex) pattern '(Python|Perl)', string '-'*n+'Perl'+'-'*n sre8 0.014 0.016 0.030 0.099 0.362 1.684 sre16 0.015 0.016 0.030 0.094 0.340 1.623 pre 0.110 0.111 0.112 0.119 0.143 0.267 regex 0.012 0.012 0.013 0.017 0.034 0.124 (in 0.9.8, sre's optimizer doesn't grok named groups, and it doesn't realize that this pattern has to start with a "P") pattern '(?:Python|Perl)', string '-'*n+'Perl'+'-'*n sre8 0.013 0.013 0.014 0.016 0.027 0.079 sre16 0.015 0.014 0.016 0.018 0.026 0.075 pre 0.108 0.135 0.113 0.137 0.140 0.275 regex skip (anonymous groups work better) pattern 'Python', string '-'*n+'Python'+'-'*n sre8 0.013 0.013 0.014 0.019 0.039 0.148 sre16 0.013 0.013 0.014 0.020 0.043 0.187 pre 0.129 0.105 0.109 0.117 0.191 0.277 regex 0.011 0.025 0.018 0.016 0.037 0.127 pattern 'Python', string 'P'*n+'Python'+'P'*n sre8 0.040 0.012 0.021 0.026 0.080 0.248 sre16 0.012 0.013 0.015 0.025 0.061 0.283 pre 0.110 0.148 0.153 0.338 0.925 4.355 regex 0.013 0.013 0.041 0.155 0.535 2.628 (as we saw in the string.find test, sre is very fast when there are lots of false matches) pattern '.*Python', string '-'*n+'Python'+'-'*n sre8 0.016 0.017 0.026 0.067 0.217 1.039 sre16 0.016 0.017 0.026 0.067 0.218 1.076 pre 0.111 0.112 0.124 0.180 0.386 1.494 regex 0.015 0.022 0.073 0.408 1.669 8.489 pattern '.*Python.*', string '-'*n+'Python'+'-'*n sre8 0.016 0.017 0.030 0.089 0.315 1.499 sre16 0.016 0.018 0.032 0.090 0.314 1.537 pre 0.112 0.113 0.129 0.186 0.413 1.605 regex 0.016 0.023 0.076 0.387 1.674 8.519 pattern '.*(Python)', string '-'*n+'Python'+'-'*n sre8 0.020 0.021 0.044 0.147 0.542 2.630 sre16 0.019 0.021 0.044 0.154 0.541 2.681 pre 0.115 0.117 0.141 0.245 0.636 2.690 regex 0.019 0.026 0.097 0.467 2.007 10.264 pattern '.*(?:Python)', string '-'*n+'Python'+'-'*n sre8 0.016 0.017 0.027 0.065 0.220 1.037 sre16 0.016 0.017 0.026 0.070 0.221 1.066 pre 0.112 0.119 0.136 0.223 0.566 2.377 regex skip pattern 'Python|Perl|Tcl', string '-'*n+'Perl'+'-'*n sre8 0.013 0.015 0.034 0.114 0.407 1.985 sre16 0.014 0.016 0.034 0.109 0.392 1.915 pre 0.107 0.108 0.117 0.124 0.167 0.393 regex 0.012 0.012 0.013 0.017 0.033 0.123 (here's another sre compiler problem: it fails to realize that this pattern starts with characters from a given set [PT]. pre and regex both use bitmaps...) pattern 'Python|Perl|Tcl', string 'P'*n+'Perl'+'P'*n sre8 0.013 0.018 0.055 0.228 0.847 4.165 sre16 0.015 0.027 0.055 0.218 0.821 4.061 pre 0.111 0.116 0.172 0.415 1.354 6.302 regex 0.011 0.019 0.085 0.374 1.467 7.261 (but when there are lots of false matches, sre is faster anyway. interesting...) pattern '(Python|Perl|Tcl)', string '-'*n+'Perl'+'-'*n sre8 0.014 0.018 0.042 0.152 0.575 2.798 sre16 0.015 0.019 0.042 0.148 0.556 2.715 pre 0.112 0.111 0.116 0.129 0.172 0.408 regex 0.012 0.013 0.014 0.018 0.035 0.124 pattern '(?:Python|Perl|Tcl)', string '-'*n+'Perl'+'-'*n sre8 0.014 0.016 0.034 0.113 0.405 1.987 sre16 0.016 0.016 0.033 0.112 0.393 1.918 pre 0.109 0.109 0.112 0.128 0.177 0.397 regex skip pattern '(Python)\\1', string '-'*n+'PythonPython'+'-'*n sre8 0.014 0.018 0.030 0.096 0.342 1.673 sre16 0.015 0.016 0.031 0.094 0.330 1.625 pre 0.112 0.111 0.112 0.119 0.141 0.268 regex 0.011 0.012 0.013 0.017 0.033 0.123 pattern '(Python)\\1', string 'P'*n+'PythonPython'+'P'*n sre8 0.013 0.016 0.035 0.111 0.411 1.976 sre16 0.015 0.016 0.034 0.112 0.416 1.992 pre 0.110 0.116 0.160 0.355 1.051 4.797 regex 0.011 0.017 0.047 0.200 0.737 3.680 pattern '([0a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n sre8 0.084 0.091 0.143 0.371 1.160 6.165 sre16 0.086 0.090 0.142 0.470 1.258 7.827 pre 0.155 0.140 0.185 0.200 0.280 0.523 regex 0.018 0.018 0.020 0.024 0.137 0.240 (again, sre's lack of "fastmap" is rather costly) pattern '(?:[0a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n sre8 0.028 0.033 0.077 0.303 1.433 7.140 sre16 0.021 0.027 0.073 0.277 1.031 5.053 pre 0.131 0.131 0.174 0.183 0.227 0.461 regex skip pattern '([a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n sre8 0.032 0.038 0.083 0.288 1.109 5.404 sre16 0.033 0.038 0.083 0.292 1.035 5.802 pre 0.195 0.135 0.176 0.187 0.233 0.468 regex 0.018 0.018 0.019 0.023 0.041 0.131 pattern '(?:[a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n sre8 0.022 0.025 0.067 0.302 1.011 8.245 sre16 0.021 0.026 0.066 0.302 1.103 5.372 pre 0.262 0.397 0.178 0.193 0.250 0.817 regex skip pattern '.*P.*y.*t.*h.*o.*n.*', string '-'*n+'Python'+'-'*n sre8 0.021 0.084 0.118 0.251 0.965 5.414 sre16 0.021 0.025 0.063 0.366 1.192 4.639 pre 0.123 0.147 0.225 0.568 1.899 9.336 regex 0.028 0.060 0.258 1.269 5.497 28.334 --------------------------------------------------------------------
On Wed, Aug 02, 2000 at 11:37:52PM +0200, Fredrik Lundh wrote:
-- SRE is usually faster than the old RE module (PRE).
Once the compiler is translated to C, it might be worth considering making SRE available as a standalone library for use outside of Python. Most other regex libraries either don't do Perl's extensions, or they don't do Unicode. Bonus points if you can get the Perl6 team interested in it. Hmm... here's an old problem that's returned (recursion on repeated group matches, I expect):
p=re.compile('(x)*') p <SRE_Pattern object at 0x8127048> p.match(500000*'x') Segmentation fault (core dumped)
--amk
Guido asked me to update my old SRE benchmarks, and post them to python-dev.
Thanks, Fredrik! This (plus the fact that SRE now passes all PRE tests) makes me very happy with using SRE as the regular expression engine of choice for 1.6 and 2.0. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
Hmm... here's an old problem that's returned (recursion on repeated group matches, I expect):
p=re.compile('(x)*') p <SRE_Pattern object at 0x8127048> p.match(500000*'x') Segmentation fault (core dumped)
Ouch. Andrew, would you mind adding a test case for that to the re test suite? It's important that this doesn't come back! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
andrew wrote:
-- SRE is usually faster than the old RE module (PRE).
Once the compiler is translated to C, it might be worth considering making SRE available as a standalone library for use outside of Python.
if it will ever be translated, that is...
Hmm... here's an old problem that's returned (recursion on repeated group matches, I expect):
p=re.compile('(x)*') p <SRE_Pattern object at 0x8127048> p.match(500000*'x') Segmentation fault (core dumped)
fwiw, that pattern isn't portable: $ jpython test.py File "test.py", line 3, in ? java.lang.StackOverflowError and neither is: def nest(level): if level: nest(level-1) nest(500000) ...but sure, I will fix that in 0.9.9 (SRE, not Python -- Christian has already taken care of the other one ;-). but 0.9.9 won't be out before the 1.6b1 release... (and to avoid scaring the hell out of the beta testers, it's probably better to leave the test out of the regression suite until the bug is fixed...) </F>
searching for literal text:
searching for "spam" in a string padded with "spaz" (1000 bytes on each side of the target):
string.find 0.112 ms sre8.search 0.059 pre.search 0.122
unicode.find 0.130 sre16.search 0.065
(yes, regular expressions can run faster than optimized C code -- as long as we don't take compilation time into account ;-)
same test, without any false matches:
string.find 0.035 ms sre8.search 0.050 pre.search 0.116
unicode.find 0.031 sre16.search 0.055
Those results are probably due to the fact that string.find does a brute force search. If it would do a last match char first search or even Boyer-Moore (this only pays off for long search targets) then it should be a lot faster than [s|p]re. Just for compares: would you mind running the search routines in mxTextTools on the same machine ? import TextTools TextTools.find(text, what) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
mal wrote:
Just for compares: would you mind running the search routines in mxTextTools on the same machine ?
searching for "spam" in a string padded with "spaz" (1000 bytes on each side of the target):
string.find 0.112 ms
texttools.find 0.080 ms
sre8.search 0.059 pre.search 0.122
unicode.find 0.130 sre16.search 0.065
same test, without any false matches (padded with "-"):
string.find 0.035 ms
texttools.find 0.083 ms
sre8.search 0.050 pre.search 0.116
unicode.find 0.031 sre16.search 0.055
Those results are probably due to the fact that string.find does a brute force search. If it would do a last match char first search or even Boyer-Moore (this only pays off for long search targets) then it should be a lot faster than [s|p]re.
does the TextTools algorithm work with arbitrary character set sizes, btw? </F>
On Thu, Aug 03, 2000 at 10:27:39AM +0200, Fredrik Lundh wrote:
if it will ever be translated, that is...
I'll agree to take a shot at it (which carries no implication of actually finisihing :) ) post-2.0. It's silly for all of Tcl, Python, Perl to grow their own implementations, when a common implementation could benefit from having 3x the number of eyes looking at it and optimizing it.
fwiw, that pattern isn't portable:
No, it isn't; the straightforward implementation of repeated groups is recursive, and fixing this requires jumping through hoops to make it nonrecursive (or adopting Python's solution and only recursing up to some upper limit). re had to get this right because regex didn't crash on this pattern, and neither do recent Perls. The vast bulk of my patches to PCRE were to fix this problem. --amk
andrew wrote:
Hmm... here's an old problem that's returned (recursion on repeated group matches, I expect):
p=re.compile('(x)*') p <SRE_Pattern object at 0x8127048> p.match(500000*'x') Segmentation fault (core dumped)
Effbot:
fwiw, that pattern isn't portable:
Who cares -- it shouldn't dump core!
...but sure, I will fix that in 0.9.9 (SRE, not Python -- Christian has already taken care of the other one ;-). but 0.9.9 won't be out before the 1.6b1 release...
I assume you are planning to put the backtracking stack back in, as you mentioned in the checkin message?
(and to avoid scaring the hell out of the beta testers, it's probably better to leave the test out of the regression suite until the bug is fixed...)
Even better, is it possible to put a limit on the recursion level before 1.6b1 is released (tomorrow if we get final agreement on the license) so at least it won't dump core? Otherwise you'll get reports of this from people who write this by accident... --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
Fredrik Lundh wrote:
mal wrote:
Just for compares: would you mind running the search routines in mxTextTools on the same machine ?
searching for "spam" in a string padded with "spaz" (1000 bytes on each side of the target):
string.find 0.112 ms
texttools.find 0.080 ms
sre8.search 0.059 pre.search 0.122
unicode.find 0.130 sre16.search 0.065
same test, without any false matches (padded with "-"):
string.find 0.035 ms
texttools.find 0.083 ms
sre8.search 0.050 pre.search 0.116
unicode.find 0.031 sre16.search 0.055
Those results are probably due to the fact that string.find does a brute force search. If it would do a last match char first search or even Boyer-Moore (this only pays off for long search targets) then it should be a lot faster than [s|p]re.
does the TextTools algorithm work with arbitrary character set sizes, btw?
The find function creates a Boyer-Moore search object for the search string (on every call). It compares 1-1 or using a translation table which is applied to the searched text prior to comparing it to the search string (this enables things like case insensitive search and character sets, but is about 45% slower). Real-life usage would be to create the search objects once per process and then reuse them. The Boyer-Moore table calcuation takes some time... But to answer your question: mxTextTools is 8-bit throughout. A Unicode aware version will follow by the end of this year. Thanks for checking, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
guido wrote:
...but sure, I will fix that in 0.9.9 (SRE, not Python -- Christian has already taken care of the other one ;-). but 0.9.9 won't be out before the 1.6b1 release...
I assume you are planning to put the backtracking stack back in, as you mentioned in the checkin message?
yup -- but that'll have to wait a few more days...
(and to avoid scaring the hell out of the beta testers, it's probably better to leave the test out of the regression suite until the bug is fixed...)
Even better, is it possible to put a limit on the recursion level before 1.6b1 is released (tomorrow if we get final agreement on the license) so at least it won't dump core?
shouldn't be too hard, given that I added a "recursion level counter" in _sre.c revision 2.30. I just added the necessary if-statement. </F>
participants (5)
-
Andrew Kuchling -
Andrew Kuchling -
Fredrik Lundh -
Guido van Rossum -
M.-A. Lemburg