[Python-ideas] re.compile_lazy - on first use compiled regexes

Sat Mar 23 16:30:39 CET 2013

On 2013-03-23, at 15:46 , Antoine Pitrou wrote:

> On Sat, 23 Mar 2013 15:35:18 +0100
> Masklinn <masklinn at masklinn.net> wrote:
>> 
>> On 2013-03-23, at 14:34 , Antoine Pitrou wrote:
>> 
>>> On Sat, 23 Mar 2013 14:26:30 +0100
>>> Masklinn <masklinn at masklinn.net> wrote:
>>>> 
>>>> Wouldn't it be better if there are *few* different regexes? Since the
>>>> module itself caches 512 expressions (100 in Python 2) and does not use
>>>> an LRU or other "smart" cache (it just clears the whole cache dict once
>>>> the limit is breached as far as I can see), *and* any explicit call to
>>>> re.compile will *still* use the internal cache (meaning even going
>>>> through re.compile will count against the _MAXCACHE limit), all regex
>>>> uses throughout the application (including standard library &al) will
>>>> count against the built-in cache and increase the chance of the regex
>>>> we want cached to be thrown out no?
>>> 
>>> Well, it mostly sounds like the re cache should be made a bit smarter.
>> 
>> It should, but even with that I think it makes sense to explicitly cache
>> regexps in the application, the re cache feels like an optimization more
>> than semantics.
> 
> Well, of course it is. A cache *is* an optimization.
> 
>> Either that, or the re module should provide an instantiable cache object
>> for lazy compilation and caching of regexps e.g.
>> re.local_cache(maxsize=None) which would return an lru-caching proxy to
>> re. Thus the caching of a module's regexps would be under the control of
>> the module using them if desired (and important)
> 
> IMO that's the wrong way to think about it. The whole point of a cache
> is that the higher levels don't have to think about it. Your CPU has
> L1, L2 and sometimes L3 caches so that you don't have to allocate your
> critical data structures in separate "faster" memory areas.
> 
> That said, if you really want to manage your own cache, it should
> already be easy to do so using functools.lru_cache() (or any
> implementation of your choice). The re module doesn't have to provide a
> dedicated caching primitive.
> 
> But, really, the point of a cache is to optimize performance *without*
> you tinkering with it.

Right, but in this case while I called it a cache the semantics really
is a lazy singleton: only create the regex object when it's needed, but
keep it around once it's been created.

The issue with a "proper cache" is that it performs via heuristics and
may or may not correctly improve performances as the heuristics will
never match all possible programs with the ideal behavior. If it's known
that we want to keep compiled regexps used by a module memoized (which
is what the current urrlib.parse code does/assumes) cache semantics
don't really work as — depending on the rest of the application — the
cached module regexps may get evicted, unless the cache has an unlimited
size.