[Speed] Performance comparison of regular expression engines
Serhiy Storchaka
storchaka at gmail.com
Sat Mar 5 13:35:22 EST 2016
I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].
Tested engines are:
* re, standard regular expression module
* regex, alternative regular expression module [3]
* re2, Python wrapper for Google's RE2 [4]
* pcre, Python PCRE bindings [5]
Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):
re regex re2 pcre str.find
Twain 5 2.866 2.118 12.47 3.911 2.72
(?i)Twain 10 84.42 4.366 24.76 17.12
[a-z]shing 165 125 5.466 27.78 180.6
Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234
\b\w+nn\b 32 239.5 427.6 23.18 251.9
[a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9
Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5
(?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4
.{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497
.{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549
Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8
[a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6
\s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6
([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3
["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6
There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.
pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:
re regex re2 pcre str.find
Twain 33 2.671 2.209 16.6 413.6 2.75
(?i)Twain 35 90.21 4.36 27.65 459.4
[a-z]shing 120 112.7 2.667 30.94 1895
Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152
\b\w+nn\b 33 238 401.4 26.93 763.7
[a-q][^u-z]{13}x 481 387.7 5.694 5915 6979
Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04
(?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04
.{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04
.{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878
Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1
[a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05
\s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05
([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119
["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04
[1] http://sljit.sourceforge.net/regex_perf.html
[2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html
[3] https://pypi.python.org/pypi/regex/2016.03.02
[4] https://pypi.python.org/pypi/re2/0.2.22
[5] https://pypi.python.org/pypi/python-pcre/0.7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: regex_bench.py
Type: text/x-python
Size: 3405 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20160305/a89e2435/attachment.py>
More information about the Speed
mailing list