Mailman 3 March 2016 - Speed

Re: [Speed] Performance comparison of regular expression engines
by Serhiy Storchaka 11 Jun '16

11 Jun '16

On 06.03.16 11:30, Maciej Fijalkowski wrote: > this is really difficult to read, can you tell me which column am I looking at? The first column is the searched pattern. The second column is the number of found matches (for control, it should be the same with all engines and versions). The third column, under the "re" header is the time in milliseconds. The column under the "str.find" header is the time of searching without using regular expressions. PyPy 2.2 usually is significantly faster than CPython 2.7, except searching plain string with regular expression. But thanks to Flexible String Representation searching plain string with and without regular expression is faster on CPython 3.6.

2 1

Software benchmarking workshop, April 20, King's College London
by Sarah Mount 24 Mar '16

24 Mar '16

Members of this list who are in the UK during April may be interested in this free workshop in central London. If you have any questions please feel free to email me directly. Best Practices in Software Benchmarking 2016 (#bench16) Wednesday April 20 2016 King's College London http://soft-dev.org/events/bench16/ For computer scientists and software engineers, benchmarking (evaluating the running time of a piece of software, or the performance of a piece of hardware) is a common method for evaluating new techniques. However, there is little agreement on how benchmarking should be carried out, how to control for confounding variables, how to analyse latency data, or how to aid the repeatability of experiments. This free workshop will be a venue for computer scientists and research software engineers to discuss their current best practices and future directions. For further information and free registration please visit: http://soft-dev.org/events/bench16/ Confirmed Speakers: Jan Vitek (Northeastern University) Joe Parker (The Jodrell Laboratory, Royal Botanic Gardens) Simon Taylor (University of Lancaster) Tomas Kalibera (Northeastern University) James Davenport (University of Bath) Edd Barrett (King's College London) Jeremy Bennett (Embecosm) Organizers: Sarah Mount & Laurence Tratt (King's College London)

1 0

Re: [Speed] Performance comparison of regular expression engines
by Antoine Pitrou 14 Mar '16

14 Mar '16

On Sun, 13 Mar 2016 17:44:10 +0000 Brett Cannon <brett(a)python.org> wrote: > > > > 2. One iteration of all searches on full text takes 29 seconds on my > > computer. Isn't this too long? In any case I want first optimize some > > bottlenecks in the re module. > > > > I don't think we have established a "too long" time. We do have some > benchmarks like spectral_norm that don't run unless you use rigorous mode > and this could be one of them. > > > 3. Do we need one benchmark that gives an accumulated time of all > > searches, or separate microbenchmarks for every pattern? > > I don't care either way. Obviously it depends on whether you want to > measure overall re perf and have people aim to improve that or let people > target specific workload types. This is a more general latent issue with our current benchmarking philosophy. We have built something which aims to be a general-purpose benchmark suite, but in some domains a more comprehensive set of benchmarks may be desirable. Obviously we don't want to have 10 JSON benchmarks, 10 re benchmarks, 10 I/O benchmarks, etc. in the default benchmarks run, so what do we do for such cases? Do we tell people domain-specific benchmarks should be developed independently? Do we include some facilities to create such subsuites without them being part of the default bunch? (note a couple domain-specific benchmarks -- iobench, stringbench, etc. -- are currently maintained separately) Regards Antoine.

2 1

Re: [Speed] Performance comparison of regular expression engines
by Serhiy Storchaka 13 Mar '16

13 Mar '16

On 07.03.16 19:19, Brett Cannon wrote: > Are you thinking about turning all of this into a benchmark for the > benchmark suite? This was my purpose. I first had written a benchmark for the benchmark suite, then I became interested in more detailed results and a comparison with alternative engines. There are several questions about a benchmark for the benchmark suite. 1. Input data is public 20MB text (8MB in ZIP file). Should we download it every time (may be with caching) or add it to the repository? 2. One iteration of all searches on full text takes 29 seconds on my computer. Isn't this too long? In any case I want first optimize some bottlenecks in the re module. 3. Do we need one benchmark that gives an accumulated time of all searches, or separate microbenchmarks for every pattern? 4. Would be nice to use the same benchmark for comparing different regular expression. This requires changing perf.py. May be we could use the same interface to compare ElementTree with lxml and json with simplejson. 5. Patterns are ASCII-only and the text is mostly ASCII. Would be nice to add non-ASCII pattern and non-ASCII text. But this will increase run time.

2 1

Performance comparison of regular expression engines
by Serhiy Storchaka 07 Mar '16

07 Mar '16

I have wrote a benchmark for comparing different regular expression engines available in Python. It uses tests and data from [1], that were itself inspired by Boost's benchmark [2]. Tested engines are: * re, standard regular expression module * regex, alternative regular expression module [3] * re2, Python wrapper for Google's RE2 [4] * pcre, Python PCRE bindings [5] Running tests for all 20MB text file takes too long time, here are results (time in millisecons) for 2MB chunk (6000000:8000000): re regex re2 pcre str.find Twain 5 2.866 2.118 12.47 3.911 2.72 (?i)Twain 10 84.42 4.366 24.76 17.12 [a-z]shing 165 125 5.466 27.78 180.6 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 57.11 72.16 23.87 234 \b\w+nn\b 32 239.5 427.6 23.18 251.9 [a-q][^u-z]{13}x 445 381.8 5.537 5843 224.9 Tom|Sawyer|Huckleberry|Finn 314 52.73 58.45 24.39 422.5 (?i)Tom|Sawyer|Huckleberry|Finn 477 445.6 522.1 27.73 415.4 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 451.2 1113 24.38 1497 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 450.1 1000 24.3 1549 Tom.{10,25}river|river.{10,25}Tom 1 61.55 58.11 24.97 233.8 [a-zA-Z]+ing 10079 189.4 350.3 47.41 357.6 \s[a-zA-Z]{0,12}ing\s 7160 115.7 23.65 37.74 237.6 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 153.7 430.4 27.86 425.3 ["'][^"']{0,30}[?!\.]["'] 1618 83.12 77.39 26.96 157.6 There is no absolute leader. All engines have its weak spots. For re these are case-insensitive search and search a pattern that starts with a set. pcre is very data-sensitive. For other 2Mb chunk (8000000:10000000) results are 1-2 orders slower: re regex re2 pcre str.find Twain 33 2.671 2.209 16.6 413.6 2.75 (?i)Twain 35 90.21 4.36 27.65 459.4 [a-z]shing 120 112.7 2.667 30.94 1895 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 61 57.12 49.9 26.76 1152 \b\w+nn\b 33 238 401.4 26.93 763.7 [a-q][^u-z]{13}x 481 387.7 5.694 5915 6979 Tom|Sawyer|Huckleberry|Finn 845 52.89 59.61 28.42 1.228e+04 (?i)Tom|Sawyer|Huckleberry|Finn 988 452.3 523.4 32.15 1.426e+04 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 845 421.1 1105 29.01 1.343e+04 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 625 398.6 985.6 29.19 9878 Tom.{10,25}river|river.{10,25}Tom 1 61.6 58.33 26.59 254.1 [a-zA-Z]+ing 10109 194.5 349.7 50.85 1.445e+05 \s[a-zA-Z]{0,12}ing\s 7286 120.1 23.73 42.04 1.051e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 43 170.6 402.9 30.84 1119 ["'][^"']{0,30}[?!\.]["'] 1686 86.5 110.2 30.62 2.369e+04 [1] http://sljit.sourceforge.net/regex_perf.html [2] http://www.boost.org/doc/libs/1_36_0/libs/regex/doc/vc71-performance.html [3] https://pypi.python.org/pypi/regex/2016.03.02 [4] https://pypi.python.org/pypi/re2/0.2.22 [5] https://pypi.python.org/pypi/python-pcre/0.7

3 2

Re: [Speed] Performance comparison of regular expression engines
by Serhiy Storchaka 06 Mar '16

06 Mar '16

On 06.03.16 09:14, Maciej Fijalkowski wrote: > Any chance you can rerun this on pypy? Results on PyPy 2.2.1 (I'm not sure I could build the last PyPy on my computer): re str.find Twain 5 5.469 3.852 (?i)Twain 10 8.646 [a-z]shing 165 17.24 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 7.763 \b\w+nn\b 32 101 [a-q][^u-z]{13}x 445 167.6 Tom|Sawyer|Huckleberry|Finn 314 8.583 (?i)Tom|Sawyer|Huckleberry|Finn 477 16.3 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 270.9 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 262 Tom.{10,25}river|river.{10,25}Tom 1 8.461 [a-zA-Z]+ing 10079 348 \s[a-zA-Z]{0,12}ing\s 7160 115.8 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 16.62 ["'][^"']{0,30}[?!\.]["'] 1618 14.45 Alternative regular expression engines need extension modules and don't work on PyPy for me. For comparison results on CPython 2.7.11+: re regex re2 pcre str.find Twain 5 4.423 2.699 8.045 93.4 4.181 (?i)Twain 10 50.07 3.563 20.35 185.6 [a-z]shing 165 98.68 6.365 23.71 2886 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 58.97 50.26 19.52 1016 \b\w+nn\b 32 130.1 416.5 18.38 740.7 [a-q][^u-z]{13}x 445 406.6 7.935 5886 7137 Tom|Sawyer|Huckleberry|Finn 314 53.09 59.1 20.33 5377 (?i)Tom|Sawyer|Huckleberry|Finn 477 281.2 338.5 23.77 7895 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 419.5 1142 20.69 6423 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 410.9 1013 18.99 5224 Tom.{10,25}river|river.{10,25}Tom 1 63.17 58.31 18.94 260.2 [a-zA-Z]+ing 10079 203.8 363.8 43.78 1.583e+05 \s[a-zA-Z]{0,12}ing\s 7160 127.1 26.65 34.23 1.114e+05 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 147.6 412.4 21.57 1172 ["'][^"']{0,30}[?!\.]["'] 1618 85.88 86.55 22.22 2.576e+04 And on Jython 2.5.3 with JRE 7: re str.find Twain 5 34 3 (?i)Twain 10 251 [a-z]shing 165 564 Huck[a-zA-Z]+|Saw[a-zA-Z]+ 52 281 \b\w+nn\b 32 510 [a-q][^u-z]{13}x 445 1786 Tom|Sawyer|Huckleberry|Finn 314 102 (?i)Tom|Sawyer|Huckleberry|Finn 477 1232 .{0,2}(Tom|Sawyer|Huckleberry|Finn) 314 1345 .{2,4}(Tom|Sawyer|Huckleberry|Finn) 237 1353 Tom.{10,25}river|river.{10,25}Tom 1 305 [a-zA-Z]+ing 10079 1211 \s[a-zA-Z]{0,12}ing\s 7160 571 ([A-Za-z]awyer|[A-Za-z]inn)\s 50 676 ["'][^"']{0,30}[?!\.]["'] 1618 431

2 1