Performance comparison of regular expression engines
I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].
Tested engines are:
- re, standard regular expression module
- regex, alternative regular expression module [3]
- re2, Python wrapper for Google's RE2 [4]
- pcre, Python PCRE bindings [5]
Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):
re regex re2 pcre str.find
Twain 5 2.866 2.118 12.47 3.911 2.72 (?i)Twain 10 84.42 4.366 24.76 17.12 [a-z]shing 165 125 5.466 27.78 180.6 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 \b\w+nn\b 32 239.5 427.6 23.18 251.9 [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6
There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.
pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:
re regex re2 pcre str.find
Twain 33 2.671 2.209 16.6 413.6 2.75 (?i)Twain 35 90.21 4.36 27.65 459.4 [a-z]shing 120 112.7 2.667 30.94 1895 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 \b\w+nn\b 33 238 401.4 26.93 763.7 [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04
[1] http://sljit.sourceforge.net/regex_perf.html [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html [3] https://pypi.python.org/pypi/regex/2016.03.02 [4] https://pypi.python.org/pypi/re2/0.2.22 [5] https://pypi.python.org/pypi/python-pcre/0.7
Hi serhiy
Any chance you can rerun this on pypy?
On Sat, Mar 5, 2016 at 8:35 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].
Tested engines are:
- re, standard regular expression module
- regex, alternative regular expression module [3]
- re2, Python wrapper for Google's RE2 [4]
- pcre, Python PCRE bindings [5]
Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):
re regex re2 pcre str.find
Twain 5 2.866 2.118 12.47 3.911 2.72 (?i)Twain 10 84.42 4.366 24.76 17.12 [a-z]shing 165 125 5.466 27.78 180.6 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 \b\w+nn\b 32 239.5 427.6 23.18 251.9 [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6
There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.
pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:
re regex re2 pcre str.find
Twain 33 2.671 2.209 16.6 413.6 2.75 (?i)Twain 35 90.21 4.36 27.65 459.4 [a-z]shing 120 112.7 2.667 30.94 1895 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 \b\w+nn\b 33 238 401.4 26.93 763.7 [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04
[1] http://sljit.sourceforge.net/regex_perf.html [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html [3] https://pypi.python.org/pypi/regex/2016.03.02 [4] https://pypi.python.org/pypi/re2/0.2.22 [5] https://pypi.python.org/pypi/python-pcre/0.7
Speed mailing list Speed@python.org https://mail.python.org/mailman/listinfo/speed
Are you thinking about turning all of this into a benchmark for the benchmark suite?
On Sat, 5 Mar 2016 at 11:15 Serhiy Storchaka <storchaka@gmail.com> wrote:
I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2].
Tested engines are:
- re, standard regular expression module
- regex, alternative regular expression module [3]
- re2, Python wrapper for Google's RE2 [4]
- pcre, Python PCRE bindings [5]
Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000):
re regex re2 pcre
str.find
Twain 5 2.866 2.118 12.47 3.911 2.72 (?i)Twain 10 84.42 4.366 24.76 17.12 [a-z]shing 165 125 5.466 27.78 180.6 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 \b\w+nn\b 32 239.5 427.6 23.18 251.9 [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6
There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set.
pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower:
re regex re2 pcre
str.find
Twain 33 2.671 2.209 16.6 413.6 2.75 (?i)Twain 35 90.21 4.36 27.65 459.4 [a-z]shing 120 112.7 2.667 30.94 1895 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 \b\w+nn\b 33 238 401.4 26.93 763.7 [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04
[1] http://sljit.sourceforge.net/regex_perf.html [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html [3] https://pypi.python.org/pypi/regex/2016.03.02 [4] https://pypi.python.org/pypi/re2/0.2.22 [5] https://pypi.python.org/pypi/python-pcre/0.7
Speed mailing list Speed@python.org https://mail.python.org/mailman/listinfo/speed
participants (3)
-
Brett Cannon
-
Maciej Fijalkowski
-
Serhiy Storchaka