Am 23.07.2010 12:20, schrieb Nick Coghlan:
On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic <hrvoje.niksic@avl.com> wrote:
The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache?
That's not quite accurate. The figure that matters is "matches per compilation", and that is ultimately governed by the nature of the application once the cache is sufficiently large (e.g. an application that compiles 10 different regular expressions, but also only matches each one against a single string is going to be slowed down significantly by a switch to regex no matter how big the compilation cache may be).
The total runtime for a particular re matching exercise is "(average compilation time)*(number of times compiled) + (average match time)*(number of matches performed)"
We know that the new regex module speeds up matching at the cost of slowing down compilation. Whether that is a net increase for the application as a whole depends on 4 things:
X: the average speed change in compilation for the application Y: the average speed change in matching for the application A: the proportion of time the application currently spends compiling regular expressions B: the proportion of time the application currently spends matching regular expressions
The overall proportional speed change for re operations in a given application is then just X*A + Y*B. For regex to be an improvement for that application, the resulting figure needs to be less than 1.
For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on average, but compilation also takes twice as long), then the overall speed change would be 2*A + 0.5*B. For that to be a wash, the original application needs to spend 1/3 of its expression matching time compiling the expressions and 2/3 actually matching them (the application with the new engine would then spend 2/3 of its time compiling and only 1/3 matching).
That's going to be a tough assessment to make in practice, but I think Georg's pygments test suite example shows that there is real world code out there that already spends a lot of time on compilation.
And for Pygments, it's not only the case for the test suite, since many applications consist of highlighting one file/one piece of code calling it as a program. When using Pygments as a library, regexes are only compiled once per lexer class, so e.g. in a web appliation that highlights files (like a repository browser), the additional cost is negligible. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.