Are you thinking about turning all of this into a benchmark for the benchmark suite?

On Sat, 5 Mar 2016 at 11:15 Serhiy Storchaka <storchaka@gmail.com> wrote:
I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].

Tested engines are:

* re, standard regular expression module
* regex, alternative regular expression module [3]
* re2, Python wrapper for Google's RE2 [4]
* pcre, Python PCRE bindings [5]

Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):

                                               re  regex    re2   pcre str.find

Twain                                   5   2.866  2.118  12.47  3.911   2.72
(?i)Twain                              10   84.42  4.366  24.76  17.12
[a-z]shing                            165     125  5.466  27.78  180.6
Huck[a-zA-Z]+|Saw[a-zA-Z]+             52   57.11  72.16  23.87    234
\b\w+nn\b                              32   239.5  427.6  23.18  251.9
[a-q][^u-z]{13}x                      445   381.8  5.537   5843  224.9
Tom|Sawyer|Huckleberry|Finn           314   52.73  58.45  24.39  422.5
(?i)Tom|Sawyer|Huckleberry|Finn       477   445.6  522.1  27.73  415.4
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   314   451.2   1113  24.38   1497
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   237   450.1   1000   24.3   1549
Tom.{10,25}river|river.{10,25}Tom       1   61.55  58.11  24.97  233.8
[a-zA-Z]+ing                        10079   189.4  350.3  47.41  357.6
\s[a-zA-Z]{0,12}ing\s                7160   115.7  23.65  37.74  237.6
([A-Za-z]awyer|[A-Za-z]inn)\s          50   153.7  430.4  27.86  425.3
["'][^"']{0,30}[?!\.]["']            1618   83.12  77.39  26.96  157.6

There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.

pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:

                                               re  regex    re2   pcre str.find

Twain                                  33   2.671  2.209   16.6  413.6   2.75
(?i)Twain                              35   90.21   4.36  27.65  459.4
[a-z]shing                            120   112.7  2.667  30.94   1895
Huck[a-zA-Z]+|Saw[a-zA-Z]+             61   57.12   49.9  26.76   1152
\b\w+nn\b                              33     238  401.4  26.93  763.7
[a-q][^u-z]{13}x                      481   387.7  5.694   5915   6979
Tom|Sawyer|Huckleberry|Finn           845   52.89  59.61  28.42 1.228e+04
(?i)Tom|Sawyer|Huckleberry|Finn       988   452.3  523.4  32.15 1.426e+04
.{0,2}(Tom|Sawyer|Huckleberry|Finn)   845   421.1   1105  29.01 1.343e+04
.{2,4}(Tom|Sawyer|Huckleberry|Finn)   625   398.6  985.6  29.19   9878
Tom.{10,25}river|river.{10,25}Tom       1    61.6  58.33  26.59  254.1
[a-zA-Z]+ing                        10109   194.5  349.7  50.85 1.445e+05
\s[a-zA-Z]{0,12}ing\s                7286   120.1  23.73  42.04 1.051e+05
([A-Za-z]awyer|[A-Za-z]inn)\s          43   170.6  402.9  30.84   1119
["'][^"']{0,30}[?!\.]["']            1686    86.5  110.2  30.62 2.369e+04

[1] http://sljit.sourceforge.net/regex_perf.html
[2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
[3] https://pypi.python.org/pypi/regex/2016.03.02
[4] https://pypi.python.org/pypi/re2/0.2.22
[5] https://pypi.python.org/pypi/python-pcre/0.7
_______________________________________________
Speed mailing list
Speed@python.org
https://mail.python.org/mailman/listinfo/speed