[Python-Dev] New regex module for 3.2?

Georg Brandl g.brandl at gmx.net
Fri Jul 23 15:44:10 CEST 2010


Am 23.07.2010 12:20, schrieb Nick Coghlan:
> On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic <hrvoje.niksic at avl.com> wrote:
>> The performance trade-off should make regex slower with sufficiently small
>> compiled regex cache, when a lot of time is wasted on compilation.  But as
>> the cache gets larger (and, for fairness, of the same size in both
>> implementations), regex should outperform re.  Georg, would you care to
>> measure if there is a difference in performance with an even larger cache?
> 
> That's not quite accurate. The figure that matters is "matches per
> compilation", and that is ultimately governed by the nature of the
> application once the cache is sufficiently large (e.g. an application
> that compiles 10 different regular expressions, but also only matches
> each one against a single string is going to be slowed down
> significantly by a switch to regex no matter how big the compilation
> cache may be).
> 
> The total runtime for a particular re matching exercise is "(average
> compilation time)*(number of times compiled) + (average match
> time)*(number of matches performed)"
> 
> We know that the new regex module speeds up matching at the cost of
> slowing down compilation. Whether that is a net increase for the
> application as a whole depends on 4 things:
> 
> X: the average speed change in compilation for the application
> Y: the average speed change in matching for the application
> A: the proportion of time the application currently spends compiling
> regular expressions
> B: the proportion of time the application currently spends matching
> regular expressions
> 
> The overall proportional speed change for re operations in a given
> application is then just X*A + Y*B. For regex to be an improvement for
> that application, the resulting figure needs to be less than 1.
> 
> For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on
> average, but compilation also takes twice as long), then the overall
> speed change would be 2*A + 0.5*B. For that to be a wash, the original
> application needs to spend 1/3 of its expression matching time
> compiling the expressions and 2/3 actually matching them (the
> application with the new engine would then spend 2/3 of its time
> compiling and only 1/3 matching).
> 
> That's going to be a tough assessment to make in practice, but I think
> Georg's pygments test suite example shows that there is real world
> code out there that already spends a lot of time on compilation.

And for Pygments, it's not only the case for the test suite, since many
applications consist of highlighting one file/one piece of code calling
it as a program.

When using Pygments as a library, regexes are only compiled once per lexer
class, so e.g. in a web appliation that highlights files (like a repository
browser), the additional cost is negligible.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.



More information about the Python-Dev mailing list