[Python-Dev] SRE 0.9.8 benchmarks

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Wed, 2 Aug 2000 23:37:52 +0200


Guido asked me to update my old SRE benchmarks, and
post them to python-dev.

Summary:

-- SRE is usually faster than the old RE module (PRE).

-- SRE is faster than REGEX on anything but very trivial
   patterns and short target strings.  And in some cases,
   it's even faster than a corresponding string.find...

-- on real-life benchmarks like XML parsing and Python
   tokenizing, SRE is 2-3 times faster than PRE.

-- using Unicode strings instead of 8-bit strings doesn't hurt
   performance (for some tests, the Unicode version is 30-40%
   faster on my machine.  Go figure...)

-- PRE is still faster for some patterns, especially when using
   long target strings.  I know why, and I plan to fix that before
   2.0 final.

enjoy /F

--------------------------------------------------------------------
These tests were made on a P3/233 MHz running Windows 95,
using a local build of the 0.9.8 release (this will go into 1.6b1,
I suppose).

--------------------------------------------------------------------
parsing xml:

running xmllib.py on hamlet.xml (280k):

sre8             7.14 seconds
sre16            7.82 seconds
pre             17.17 seconds

(for the sre16 test, the xml file was converted to unicode before
it was fed to the unmodified parser).

for comparision, here's the results for a couple of fast pure-Python
parsers:

rex/pre          2.44 seconds
rex/sre          0.59 seconds
srex/sre         0.16 seconds

(rex is a shallow XML parser, based on code by Robert Cameron.  srex
is an even simpler shallow parser, using sre's template mode).

--------------------------------------------------------------------
parsing python:

running tokenize.py on Tkinter.py (156k):

sre8             3.23 seconds
pre              7.57 seconds

--------------------------------------------------------------------
searching for literal text:

searching for "spam" in a string padded with "spaz" (1000 bytes on
each side of the target):

string.find     0.112 ms
sre8.search     0.059
pre.search      0.122

unicode.find    0.130
sre16.search    0.065

(yes, regular expressions can run faster than optimized C code -- as
long as we don't take compilation time into account ;-)

same test, without any false matches:

string.find     0.035 ms
sre8.search     0.050
pre.search      0.116

unicode.find    0.031
sre16.search    0.055

--------------------------------------------------------------------
compiling regular expressions

compiling the 480 tests in the standard test suite:

sre             1.22 seconds
pre             0.05 seconds

or in other words, pre (using a compiler written in C) can
compile just under 10,000 patterns per second.  sre can only
compile about 400 pattern per second.  do we care? ;-)

(footnote: sre's pattern cache stores 100 patterns.  pre's
cache hold 20 patterns, iirc).

--------------------------------------------------------------------
benchmark suite

to round off this report, here's a couple of "micro benchmarks".
all times are in milliseconds.

n=3D        0     5    50   250  1000  5000
----- ----- ----- ----- ----- ----- -----

pattern 'Python|Perl', string '-'*n+'Perl'+'-'*n
sre8  0.014 0.013 0.013 0.016 0.027 0.079
sre16 0.014 0.014 0.015 0.018 0.025 0.076
pre   0.107 0.109 0.114 0.116 0.135 0.259
regex 0.011 0.011 0.012 0.016 0.033 0.122

pattern 'Python|Perl', string 'P'*n+'Perl'+'P'*n
sre8  0.013 0.016 0.030 0.100 0.358 1.716
sre16 0.014 0.015 0.030 0.094 0.347 1.649
pre   0.115 0.112 0.158 0.351 1.085 5.002
regex 0.010 0.016 0.060 0.271 1.022 5.162

(false matches causes problems for pre and regex)

pattern '(Python|Perl)', string '-'*n+'Perl'+'-'*n
sre8  0.014 0.016 0.030 0.099 0.362 1.684
sre16 0.015 0.016 0.030 0.094 0.340 1.623
pre   0.110 0.111 0.112 0.119 0.143 0.267
regex 0.012 0.012 0.013 0.017 0.034 0.124

(in 0.9.8, sre's optimizer doesn't grok named groups, and
it doesn't realize that this pattern has to start with a "P")

pattern '(?:Python|Perl)', string '-'*n+'Perl'+'-'*n
sre8  0.013 0.013 0.014 0.016 0.027 0.079
sre16 0.015 0.014 0.016 0.018 0.026 0.075
pre   0.108 0.135 0.113 0.137 0.140 0.275
regex skip

(anonymous groups work better)

pattern 'Python', string '-'*n+'Python'+'-'*n
sre8  0.013 0.013 0.014 0.019 0.039 0.148
sre16 0.013 0.013 0.014 0.020 0.043 0.187
pre   0.129 0.105 0.109 0.117 0.191 0.277
regex 0.011 0.025 0.018 0.016 0.037 0.127

pattern 'Python', string 'P'*n+'Python'+'P'*n
sre8  0.040 0.012 0.021 0.026 0.080 0.248
sre16 0.012 0.013 0.015 0.025 0.061 0.283
pre   0.110 0.148 0.153 0.338 0.925 4.355
regex 0.013 0.013 0.041 0.155 0.535 2.628

(as we saw in the string.find test, sre is very fast when
there are lots of false matches)

pattern '.*Python', string '-'*n+'Python'+'-'*n
sre8  0.016 0.017 0.026 0.067 0.217 1.039
sre16 0.016 0.017 0.026 0.067 0.218 1.076
pre   0.111 0.112 0.124 0.180 0.386 1.494
regex 0.015 0.022 0.073 0.408 1.669 8.489

pattern '.*Python.*', string '-'*n+'Python'+'-'*n
sre8  0.016 0.017 0.030 0.089 0.315 1.499
sre16 0.016 0.018 0.032 0.090 0.314 1.537
pre   0.112 0.113 0.129 0.186 0.413 1.605
regex 0.016 0.023 0.076 0.387 1.674 8.519

pattern '.*(Python)', string '-'*n+'Python'+'-'*n
sre8  0.020 0.021 0.044 0.147 0.542 2.630
sre16 0.019 0.021 0.044 0.154 0.541 2.681
pre   0.115 0.117 0.141 0.245 0.636 2.690
regex 0.019 0.026 0.097 0.467 2.007 10.264

pattern '.*(?:Python)', string '-'*n+'Python'+'-'*n
sre8  0.016 0.017 0.027 0.065 0.220 1.037
sre16 0.016 0.017 0.026 0.070 0.221 1.066
pre   0.112 0.119 0.136 0.223 0.566 2.377
regex skip

pattern 'Python|Perl|Tcl', string '-'*n+'Perl'+'-'*n
sre8  0.013 0.015 0.034 0.114 0.407 1.985
sre16 0.014 0.016 0.034 0.109 0.392 1.915
pre   0.107 0.108 0.117 0.124 0.167 0.393
regex 0.012 0.012 0.013 0.017 0.033 0.123

(here's another sre compiler problem: it fails to realize
that this pattern starts with characters from a given set
[PT].  pre and regex both use bitmaps...)

pattern 'Python|Perl|Tcl', string 'P'*n+'Perl'+'P'*n
sre8  0.013 0.018 0.055 0.228 0.847 4.165
sre16 0.015 0.027 0.055 0.218 0.821 4.061
pre   0.111 0.116 0.172 0.415 1.354 6.302
regex 0.011 0.019 0.085 0.374 1.467 7.261

(but when there are lots of false matches, sre is faster
anyway.  interesting...)

pattern '(Python|Perl|Tcl)', string '-'*n+'Perl'+'-'*n
sre8  0.014 0.018 0.042 0.152 0.575 2.798
sre16 0.015 0.019 0.042 0.148 0.556 2.715
pre   0.112 0.111 0.116 0.129 0.172 0.408
regex 0.012 0.013 0.014 0.018 0.035 0.124

pattern '(?:Python|Perl|Tcl)', string '-'*n+'Perl'+'-'*n
sre8  0.014 0.016 0.034 0.113 0.405 1.987
sre16 0.016 0.016 0.033 0.112 0.393 1.918
pre   0.109 0.109 0.112 0.128 0.177 0.397
regex skip

pattern '(Python)\\1', string '-'*n+'PythonPython'+'-'*n
sre8  0.014 0.018 0.030 0.096 0.342 1.673
sre16 0.015 0.016 0.031 0.094 0.330 1.625
pre   0.112 0.111 0.112 0.119 0.141 0.268
regex 0.011 0.012 0.013 0.017 0.033 0.123

pattern '(Python)\\1', string 'P'*n+'PythonPython'+'P'*n
sre8  0.013 0.016 0.035 0.111 0.411 1.976
sre16 0.015 0.016 0.034 0.112 0.416 1.992
pre   0.110 0.116 0.160 0.355 1.051 4.797
regex 0.011 0.017 0.047 0.200 0.737 3.680

pattern '([0a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n
sre8  0.084 0.091 0.143 0.371 1.160 6.165
sre16 0.086 0.090 0.142 0.470 1.258 7.827
pre   0.155 0.140 0.185 0.200 0.280 0.523
regex 0.018 0.018 0.020 0.024 0.137 0.240

(again, sre's lack of "fastmap" is rather costly)

pattern '(?:[0a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n
sre8  0.028 0.033 0.077 0.303 1.433 7.140
sre16 0.021 0.027 0.073 0.277 1.031 5.053
pre   0.131 0.131 0.174 0.183 0.227 0.461
regex skip

pattern '([a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n
sre8  0.032 0.038 0.083 0.288 1.109 5.404
sre16 0.033 0.038 0.083 0.292 1.035 5.802
pre   0.195 0.135 0.176 0.187 0.233 0.468
regex 0.018 0.018 0.019 0.023 0.041 0.131

pattern '(?:[a-z][a-z0-9]*,)+', string '-'*n+'a5,b7,c9,'+'-'*n
sre8  0.022 0.025 0.067 0.302 1.011 8.245
sre16 0.021 0.026 0.066 0.302 1.103 5.372
pre   0.262 0.397 0.178 0.193 0.250 0.817
regex skip

pattern '.*P.*y.*t.*h.*o.*n.*', string '-'*n+'Python'+'-'*n
sre8  0.021 0.084 0.118 0.251 0.965 5.414
sre16 0.021 0.025 0.063 0.366 1.192 4.639
pre   0.123 0.147 0.225 0.568 1.899 9.336
regex 0.028 0.060 0.258 1.269 5.497 28.334

--------------------------------------------------------------------