[Python-ideas] re.compile_lazy - on first use compiled regexes

Stefan Behnel stefan_ml at behnel.de
Sat Mar 23 20:02:41 CET 2013


Masklinn, 23.03.2013 14:26:
> On 2013-03-23, at 03:00 , Nick Coghlan wrote:
>> On Fri, Mar 22, 2013 at 3:42 PM, Gregory P. Smith wrote:
>>> On Fri, Mar 22, 2013 at 3:31 PM, Ronny Pfannschmidt wrote:
>>>> while reviewing urllib.parse i noticed a pretty ugly pattern
>>>> many functions had an attached global and in their own code they would
>>>> compile an regex on first use and assign it to that global
>>>>
>>>> its clear that compiling a regex is expensive, so having them be compiled
>>>> later at first use would be of some benefit
>>>
>>> It isn't expensive to do, it is expensive to do repeatedly for no reason.
>>> Thus the use of compiled regexes.  Code like this would be better off
>>> refactored to reference a precompiled global rather than conditionally check
>>> if it needs compiling every time it is called.
>>
>> Alternatively, if there are a lot of different regexes, it may be
>> better to rely on the implicit cache inside the re module.
> 
> Wouldn't it be better if there are *few* different regexes? Since the
> module itself caches 512 expressions (100 in Python 2) and does not use
> an LRU or other "smart" cache (it just clears the whole cache dict once
> the limit is breached as far as I can see), *and* any explicit call to
> re.compile will *still* use the internal cache (meaning even going
> through re.compile will count against the _MAXCACHE limit), all regex
> uses throughout the application (including standard library &al) will
> count against the built-in cache and increase the chance of the regex
> we want cached to be thrown out no?

Remember that any precompiled regex that got thrown out of the cache will
be rebuilt as soon as it's being used. So the problem only ever arises when
you really have more than _MAXCACHE different regexes that are all being
used within the same loop, and even then, they'd have to be used in
(mostly) the same order to draw the cache completely useless. That's a very
rare case, IMHO. In all other cases, whenever the number of different
regexes that are being used within a loop is lower than _MAXCACHE, the
cache will immediately bring a substantial net win. And if a regex is not
being used in a loop, then it's really unlikely that its compilation time
will dominate the runtime of your application (assuming that your
application is doing more than just compiling regexes...).

Stefan





More information about the Python-ideas mailing list