re.compile_lazy - on first use compiled regexes

Hi, while reviewing urllib.parse i noticed a pretty ugly pattern many functions had an attached global and in their own code they would compile an regex on first use and assign it to that global its clear that compiling a regex is expensive, so having them be compiled later at first use would be of some benefit but instead of all that reptetive code there should be an alternative to re.compile that waits with compilation for the first use -- Ronny

On Fri, Mar 22, 2013 at 3:31 PM, Ronny Pfannschmidt < Ronny.Pfannschmidt@gmx.de> wrote:
It isn't expensive to do, it is expensive to do repeatedly for no reason. Thus the use of compiled regexes. Code like this would be better off refactored to reference a precompiled global rather than conditionally check if it needs compiling every time it is called. -gps

On 2013-03-23, at 03:00 , Nick Coghlan wrote:
Wouldn't it be better if there are *few* different regexes? Since the module itself caches 512 expressions (100 in Python 2) and does not use an LRU or other "smart" cache (it just clears the whole cache dict once the limit is breached as far as I can see), *and* any explicit call to re.compile will *still* use the internal cache (meaning even going through re.compile will count against the _MAXCACHE limit), all regex uses throughout the application (including standard library &al) will count against the built-in cache and increase the chance of the regex we want cached to be thrown out no?

On Sat, Mar 23, 2013 at 3:34 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
See http://bugs.python.org/issue17441. Best Regards, Ezio Melotti
Regards
Antoine.

On 2013-03-23, at 14:34 , Antoine Pitrou wrote:
It should, but even with that I think it makes sense to explicitly cache regexps in the application, the re cache feels like an optimization more than semantics. Either that, or the re module should provide an instantiable cache object for lazy compilation and caching of regexps e.g. re.local_cache(maxsize=None) which would return an lru-caching proxy to re. Thus the caching of a module's regexps would be under the control of the module using them if desired (and important), and urllib.parse could fix its existing "ugly" pattern by using import re re = re.local_cache() and removing the conditional compile calls (or even the compile call and using re-level functions) Optionally, the cache could take e.g. an *args of regexp to precompile at module load/cache creation.

On Sat, 23 Mar 2013 15:35:18 +0100 Masklinn <masklinn@masklinn.net> wrote:
Well, of course it is. A cache *is* an optimization.
IMO that's the wrong way to think about it. The whole point of a cache is that the higher levels don't have to think about it. Your CPU has L1, L2 and sometimes L3 caches so that you don't have to allocate your critical data structures in separate "faster" memory areas. That said, if you really want to manage your own cache, it should already be easy to do so using functools.lru_cache() (or any implementation of your choice). The re module doesn't have to provide a dedicated caching primitive. But, really, the point of a cache is to optimize performance *without* you tinkering with it. Regards Antoine.

On 2013-03-23, at 15:46 , Antoine Pitrou wrote:
Right, but in this case while I called it a cache the semantics really is a lazy singleton: only create the regex object when it's needed, but keep it around once it's been created. The issue with a "proper cache" is that it performs via heuristics and may or may not correctly improve performances as the heuristics will never match all possible programs with the ideal behavior. If it's known that we want to keep compiled regexps used by a module memoized (which is what the current urrlib.parse code does/assumes) cache semantics don't really work as — depending on the rest of the application — the cached module regexps may get evicted, unless the cache has an unlimited size.

On Sat, Mar 23, 2013 at 8:33 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
As in something like: def compute_once(f, *args, **kwds): value = not_called = object() @wraps(f): def compute_on_demand(): nonlocal value if value is not_called: value = f(*args, **kwds) return value return compute_on_demand _pattern = compute_once(re.compile, r"my very long regex") def use_pattern(data): return _pattern().search(data) Runtime overhead is then just the identity check for the initial sentinel value. The difference with both functools.partial and functools.lru_cache is that it wouldn't support customisation at call time - you have to fully define the operation up front, the only thing you're deferring is the actual call. That call will only happen once, with all subsequent calls returning the same value as the initial call. This is what allows the call time overhead to be stripped back to almost nothing. Seems like a reasonable justification for a dedicated tool to me. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Masklinn, 23.03.2013 14:26:
Remember that any precompiled regex that got thrown out of the cache will be rebuilt as soon as it's being used. So the problem only ever arises when you really have more than _MAXCACHE different regexes that are all being used within the same loop, and even then, they'd have to be used in (mostly) the same order to draw the cache completely useless. That's a very rare case, IMHO. In all other cases, whenever the number of different regexes that are being used within a loop is lower than _MAXCACHE, the cache will immediately bring a substantial net win. And if a regex is not being used in a loop, then it's really unlikely that its compilation time will dominate the runtime of your application (assuming that your application is doing more than just compiling regexes...). Stefan

Hi, On Sat, Mar 23, 2013 at 12:42 AM, Gregory P. Smith <greg@krypto.org> wrote:
Sometimes it is, see e.g. http://bugs.python.org/issue11454. Best Regards, Ezio Melotti

On Fri, 22 Mar 2013 15:42:49 -0700 "Gregory P. Smith" <greg@krypto.org> wrote:
Precompiled regexes were a major contributor in Python startup time: http://bugs.python.org/issue13150 http://hg.python.org/cpython/rev/df950158dc33/ In the stdlib we should be rather careful about this. Third-party libraries can ignore such concerns if they aren't meant to use in interactive applications. Regards Antoine.

On 23.03.2013 13:18, Antoine Pitrou wrote:
Wouldn't it make sense to add a way to pickle or marshal compiled REs ? The precompiled REs could then be loaded directly from the pickle, avoiding the compiling overhead on startup. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 23.03.2013 13:43, Ronny Pfannschmidt wrote:
I wasn't thinking of making it part of the Python byte-code. It would suffice to add pickle/marshal support for the compiled RE code. This could then be loaded from a string embedded in the module code on startup. E.g. # rx = re.compile('.*') rx = pickle.loads('asdfsadfasdf') It would also be possible to seed the re module cache with such pickle.loads, perhaps compiled at Python build time. This would avoid having to change code in the stdlib to load pickles. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, Mar 23, 2013 at 11:52 PM, M.-A. Lemburg <mal@egenix.com> wrote:
What would that do to versioning? Currently, as I understand it, the compiled RE is a complete implementation detail; at any time, the re module can change how it stores it. Pickles (again, as I understand it - I may be wrong) should be readable on other versions of Python (forward-compatibly, at least), on other architectures, etc, etc; would this be a problem? Alternatively, at the expense of some storage space, there could be some kind of fallback. If the tag doesn't perfectly match the creating Python's tag, it ignores the dumped version and just compiles it as normal. Hmm. Here's a mad thought - a bit of latticed casementing, if you like. Could the compiled regexes be stored in the .pyc file? That already has version tagging done. All it'd take is some sort of extension mechanism that says "hey, here's some additional data that the pyc might want to make use of". Or would that overly complicate matters? ChrisA

On 23 March 2013 10:15, Chris Angelico <rosuav@gmail.com> wrote:
Pleas enote that compiled reg-expes can already be pickled straightforwardly. Unfortunatelly, to avoid the version issues you mention, from overlooking the pickled string, it looks like it just calls "re.compile" with the original regex on unpickle - so there would be no gain from the implementation as is. (I should stop being that lazy, and check what does unpickling a regexp actually does = Ah --Ezio found it while I was at it)
I can't see how this could be achieved but for adding a special syntax that would compile reg-exps at parsing time. Then, we might as well use Perl instead :-) But maybe some custom serializing could go straight into the sre_code that would proper serialize its objects as python-bytecode, and them some helper functions to load them from a custom made pyc file. These pre-generated pycs would be built at Python build time.
ChrisA _______________________________________________

On 23 March 2013 11:09, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
There it is, straight in re.py: import copyreg def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(_pattern_type, _pickle, _compile) So, pickling regexps as they are now are definitely no speed-up.

On Sun, Mar 24, 2013 at 1:09 AM, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Yeah, that's the most obvious form - some kind of regex literal syntax. I was thinking, though, that there might be some sort of extension to the pyc format that lets any module add precompiled data to it; the trouble would then be figuring out how to recognize what ought to get dumped into the pyc. It'd effectively need to be something that gets added to the code like: foo = re.compile('fo+') bar = re.compile('ba+r') re.precompile(foo,bar) That could then pre-populate some kind of cache that gets loaded with the pyc, and then when re.compile() gets a particular string, it looks it up in the cache and finds the precompiled version. Of course, this would quite possibly be more effort than it's worth. Complicating the pyc format in this way needs a lot of justification. ChrisA

On Sat, Mar 23, 2013 at 2:52 PM, M.-A. Lemburg <mal@egenix.com> wrote:
According to http://bugs.python.org/issue11454#msg170697, this would be twice as slow. Best Regards, Ezio Melotti

On 23.03.2013 14:53, Ezio Melotti wrote:
RE objects can already be pickled and that's also what was measured in that message. It doesn't actually pickle the RE "byte" code, though. Instead it just pickles the pattern and the flags and does a complete recompile when unpickling the RE object. I was talking about actually pickling the RE "byte" code that the re module generates to avoid the overhead of having to recompile the pattern. I'm pretty sure this would be faster :-)
On 23.03.2013 14:36, Oleg Broytman wrote:
The patterns used in the stdlib could be precompiled at Python build time and then stored away in a separate module, say _re_cache_preloader.py. This module would then be used to seed the re module cache. Since the cache works by looking at the patterns, no recompile would happen when calling re.compile() on these patterns. The approach is similar to the way the sysconfig module information is cached in a separate module to reduce startup time (something we also did in eGenix PyRun to improve startup time - and because we had to do it anyway, since there are no Makefile available to parse when running pyrun). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 23 Mar 2013 20:41:49 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
The problem is that for pickles to be durable, you would then need some kind of compatibility guarantee for the re bytecode. Otherwise you might add the bytecode version number to the pickle, and then ignore the bytecode when loading the pickle and the current version number is different; but that would mean people would lose the benefit of caching without being warned, which would make performance more fickle. Regards Antoine.

On 23.03.2013 20:41, Antoine Pitrou wrote:
Hmm, I'm not following you. The patterns would get compiled once at Python build time when installing the stdlib. The bytecode version wouldn't change for those compiled patterns - unless, of course, you upgrade to a new Python version, but then you'd rebuild the bytecode versions of the REs :-) To make them generally useful, I agree, you would have to add a RE compiler version to the bytecode pickle, but AFAICS this should not affect the usefulness for the stdlib RE cache. The whole idea is really very similar to the Python VM bytecode caching Python is using to speedup imports of modules. Perhaps we could have a GSoC student give it a try and see whether it makes results in noticable startup time speedups ?! -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 23 Mar 2013 22:19:02 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Ah, you're talking only about the stdlib. Well, sure, that would work, but we have to remember to regenerate those pickles by hand each time the re bytecode is updated (which doesn't happen often, admittedly). That's a bit of a maintenance burden.
The whole idea is really very similar to the Python VM bytecode caching Python is using to speedup imports of modules.
Except that the VM bytecode caching works automatically and transparently :-)
Perhaps we could have a GSoC student give it a try and see whether it makes results in noticable startup time speedups ?!
That's a rather smallish topic for a GSoC project, IMHO. Regards Antoine.

On 23.03.2013 22:20, Antoine Pitrou wrote:
No, that would happen at build time automatically. setup.py would create the module with the pickled RE bytecodes by scanning the stdlib modules for RE patterns, the re module would use this to seed its cache. That's the high-level idea. I'm sure there are a few pitfalls along the way :-)
Should be the same for the REs in the stdlib. The user wouldn't notice (except for the speedup hopefully). Code in the stdlib compiling the REs wouldn't need to be touched either, since the cache in the re module would simply reuse the compiled versions.
Well, you could extend it by adding some RE optimization tasks on top of it :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, Mar 23, 2013 at 01:38:19PM +0100, "M.-A. Lemburg" <mal@egenix.com> wrote:
But with an overhead of opening files and unpickling. My wild guess [not backed by numbers] is that it would be as slow. I suspect the only way to speedup things would be to precompile regexps at compile (generating .pyc) time and saving compiled regexps in the byte code file. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

To summarize: - compiling regexes is slow so applications frequently compute it once and save it - compiling all the regexes at startup slows down startup for regexes that may never be used - a common pattern is to compute once at time of use and it would be nice to optimize this pattern - the regex library has a cache feature which means that frequently it will be optimized automatically - however, there's no guarantee that the regex you care about won't fall out of the cache. I think this addresses all the issues better than compute_lazy: re.compile(r'...', keep=True) When keep=True is specified, the regex library keeps the cached value for the lifetime of the process. The regex is computed only once on first use and you don't need to create a place to store it. Furthermore, if you use the same regex in more than one place, once with keep=True, the other uses will automatically be optimized. --- Bruce Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

keep=True defeats the purpose of a caching strategy. An re.compile call within some code somewhere is typically not in a position to know if it is going to be called a lot. I think the code, as things are now, with dynamic construction at runtime based on a simple test is the best of both worlds to avoid the more complicated cost of calling re.compile and going through its cache logic. If the caching is ever is improved in the future to be faster, the code can arguably be simplified to use re.search or re.match directly and rely solely on the caching. ie: don't change anything. On Sat, Mar 23, 2013 at 4:03 PM, Bruce Leban <bruce@leapyear.org> wrote:

On Sat, Mar 23, 2013 at 4:14 PM, Gregory P. Smith <greg@krypto.org> wrote:
Truth is people are currently doing caching themselves, by compiling and then keeping the compiled regex. Saying they're not in a position to know whether or not to do that isn't going to change that. Is it worthwhile having the regex library facilitate this manual caching? --- Bruce Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

Gregory P. Smith, 24.03.2013 00:48:
+1 If I had been "more aware" of the re internal cache during the last years, I would have avoided at least a couple of re.compile() calls in my code, I guess. Maybe this is something that the documentation of re.compile() can help with, by telling people explicitly that this apparently cool feature of pre-compiling actually has a drawback in it (startup time + a bit of memory usage) and that they won't notice a runtime difference in most cases anyway. Stefan

On 28/03/13 10:06, Terry Reedy wrote:
On the contrary, I think that it is the cache which is an (unattractive) nuisance. Like any cache, performance is only indirectly under your control. You cannot know for sure whether re.match(some_pattern, text) will be a cheap cache hit or an expensive re-compilation. All you can do is keep increasing the size of the cache until the chance of a cache miss is "low enough", whatever that means for you, and hope. I cannot think of any object in the Python standard library where the recommended API is to repeatedly convert from strings each time you need the object. We do this: x = Decimal(some_string) y = x**3 z = x.exp() not this: y = Decimal(some_string)**3 z = Decimal(some_string).exp() hoping that the string will be in a cache and the conversion will be fast. So why do we do this? result = re.match(some_string, text) other_result = re.match(some_string, other_text) -- Steven

On Thu, Mar 28, 2013 at 12:25 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Would it be better if, instead of: pat = re.compile(some_string) it were spelled: pat = re.RegExp(some_string) ? It'd match Decimal and so on, while still being exactly the same thing ultimately - you turn the textual regex into an object. ChrisA

On 28/03/13 12:38, Chris Angelico wrote:
No, you seem to have missed my point. I don't care whether we have a builder like re.compile() that turns a string into a regular expression object, or we use the RegExp type constructor directly. What I care about is that we don't recommend that people rely on the cache to handle that conversion, as Terry seems to be suggesting. -- Steven

On Thu, Mar 28, 2013 at 12:56 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, that's what I mean. If compile were renamed RegExp and the cache abolished, would people find this odd, or would they be happy to do their own compiled-regex retention? I suspect the latter. There could still be (non-caching) re.match and friends, but it'd be understood that they are less efficient for multiple usage. ChrisA

An alternative would be to use a lazy proxy. Given LazyProxy defined as below you can use pat = LazyProxy(re.compile, r"\w+") instead of pat = re.compile(r"\w+") This causes a minor slow down (~7%) if you use pat.search() in a loop. class LazyProxy(object): def __init__(self, func, *args): self.__func_args = (func, args) def __getattr__(self, name): try: obj = object.__getattribute__(self, '__obj') except AttributeError: func, args = self.__func_args obj = self.__obj = func(*args) res = getattr(obj, name) setattr(self, name, res) return res -- Richard

Terry Reedy writes:
With a decent re cache size, .compile seems more like an attractive nuisance that something useful.
In applications like text editors, you often have scads of ephemeral regexps (entered by users) but some you'd like to keep around (python_keyword_re, for example). I don't know if this is actually a performance hit in such applications, but like Steven d'A, I'm not comfortable in relying on a cache for performance-critical apps. It seems to me that the cache is a backstop to reduce the chance that an application repeatedly compiles a given regexp in an inner loop, but can't guarantee elimination of all such performance problems. I'm also not clear on why you consider it an attractive nuisance. It's annoying enough that people will only do it for regexps they consider important, so I doubt people will be compiling thousands of regexps that only get used once in a blue moon. What's the loss to having the facility available (aside from the overhead of maintenance and documentation -- "attractive nuisance" implies users with holes in their feet :-)?

On Sat, Mar 23, 2013 at 4:03 PM, Bruce Leban <bruce@leapyear.org> wrote:
Nice summary. The real problem is, I think, that many developers are not aware of the default caching done by the re module. I have a hunch that if this was better known, fewer manual optimization attempts would spring up. How about examining what the size of that re cache is, and how much memory it typically occupies. Perhaps this cache can be changed to fit more regexes? Eli

Eli Bendersky, 24.03.2013 04:39:
The problem is that there is no "default workload" to measure against. If a stupid dispatch engine automatically generates tons of regexes to compare, say, URLs against, you can make the cache as large as you want and it won't help. I doubt that that's a serious use case, though - combining all of them into a single regex (and then actually pre-compiling that statically instead of relying on the cache) would be way smarter. Apart from extreme cases like the above, a cache size of 512 sounds *plenty* large, more like a "we don't know, so make it large" kind of choice. Maybe measuring could actually help in making it *smaller*, but I guess that's simply not worth it. If the space is not used, it won't grow that large anyway. Stefan

On 23/03/13 09:31, Ronny Pfannschmidt wrote:
Since there are only eight of them, and none of them are particularly big or complicated regexes, I believe that they should be unconditionally compiled at startup. E.g. instead of this code: # from urllib.parse _typeprog = None def splittype(url): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" global _typeprog if _typeprog is None: import re _typeprog = re.compile('^([^/:]+):') match = _typeprog.match(url) [...] something like this is much simpler and more obvious: _typeprog = re.compile('^([^/:]+):') def splittype(url): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" match = _typeprog.match(url) [...] or we can get rid of the global altogether: def splittype(url, _typeprog=re.compile('^([^/:]+):')): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" match = _typeprog.match(url) [...]
its clear that compiling a regex is expensive, so having them be compiled later at first use would be of some benefit
Sounds like premature optimization to me. I've extracted out the regex patterns, and timed how long it takes to compile them. On my computer, it's a one-off cost of about 3 milliseconds. Here's the code I used to time it: # === start === import re import timeit patterns = [ ('^([^/:]+):', ), ('^//([^/?]*)(.*)$', ), ('^(.*)@(.*)$', ), ('^([^:]*):(.*)$', re.S), ('^(.*):([0-9]+)$', ), ('^(.*):(.*)$', ), ('^(.*)\?([^?]*)$', ), ('^(.*)#([^#]*)$', ), ('^([^=]*)=(.*)$', ), ] setup = "from __main__ import re, patterns" t1 = timeit.Timer(""" re.purge() for pat in patterns: re.compile(*pat) """, setup) t2 = timeit.Timer(""" re.purge() for pat in patterns: pass """, setup) print("compiling urllib.parse patterns:") a = min(t1.repeat(number=1000, repeat=7)) b = min(t2.repeat(number=1000, repeat=7)) print(a-b, "ms") # === end === Admittedly this would (probably) increase the time taken to import urllib.parse by about 50% (from 5-6ms to 8-9ms), but I don't see this as significant. It's not even a real cost -- you still have to compile the patterns at some point, the only question is whether they are done up front or as needed. -- Steven

On Fri, Mar 22, 2013 at 3:31 PM, Ronny Pfannschmidt < Ronny.Pfannschmidt@gmx.de> wrote:
It isn't expensive to do, it is expensive to do repeatedly for no reason. Thus the use of compiled regexes. Code like this would be better off refactored to reference a precompiled global rather than conditionally check if it needs compiling every time it is called. -gps

On 2013-03-23, at 03:00 , Nick Coghlan wrote:
Wouldn't it be better if there are *few* different regexes? Since the module itself caches 512 expressions (100 in Python 2) and does not use an LRU or other "smart" cache (it just clears the whole cache dict once the limit is breached as far as I can see), *and* any explicit call to re.compile will *still* use the internal cache (meaning even going through re.compile will count against the _MAXCACHE limit), all regex uses throughout the application (including standard library &al) will count against the built-in cache and increase the chance of the regex we want cached to be thrown out no?

On Sat, Mar 23, 2013 at 3:34 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
See http://bugs.python.org/issue17441. Best Regards, Ezio Melotti
Regards
Antoine.

On 2013-03-23, at 14:34 , Antoine Pitrou wrote:
It should, but even with that I think it makes sense to explicitly cache regexps in the application, the re cache feels like an optimization more than semantics. Either that, or the re module should provide an instantiable cache object for lazy compilation and caching of regexps e.g. re.local_cache(maxsize=None) which would return an lru-caching proxy to re. Thus the caching of a module's regexps would be under the control of the module using them if desired (and important), and urllib.parse could fix its existing "ugly" pattern by using import re re = re.local_cache() and removing the conditional compile calls (or even the compile call and using re-level functions) Optionally, the cache could take e.g. an *args of regexp to precompile at module load/cache creation.

On Sat, 23 Mar 2013 15:35:18 +0100 Masklinn <masklinn@masklinn.net> wrote:
Well, of course it is. A cache *is* an optimization.
IMO that's the wrong way to think about it. The whole point of a cache is that the higher levels don't have to think about it. Your CPU has L1, L2 and sometimes L3 caches so that you don't have to allocate your critical data structures in separate "faster" memory areas. That said, if you really want to manage your own cache, it should already be easy to do so using functools.lru_cache() (or any implementation of your choice). The re module doesn't have to provide a dedicated caching primitive. But, really, the point of a cache is to optimize performance *without* you tinkering with it. Regards Antoine.

On 2013-03-23, at 15:46 , Antoine Pitrou wrote:
Right, but in this case while I called it a cache the semantics really is a lazy singleton: only create the regex object when it's needed, but keep it around once it's been created. The issue with a "proper cache" is that it performs via heuristics and may or may not correctly improve performances as the heuristics will never match all possible programs with the ideal behavior. If it's known that we want to keep compiled regexps used by a module memoized (which is what the current urrlib.parse code does/assumes) cache semantics don't really work as — depending on the rest of the application — the cached module regexps may get evicted, unless the cache has an unlimited size.

On Sat, Mar 23, 2013 at 8:33 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
As in something like: def compute_once(f, *args, **kwds): value = not_called = object() @wraps(f): def compute_on_demand(): nonlocal value if value is not_called: value = f(*args, **kwds) return value return compute_on_demand _pattern = compute_once(re.compile, r"my very long regex") def use_pattern(data): return _pattern().search(data) Runtime overhead is then just the identity check for the initial sentinel value. The difference with both functools.partial and functools.lru_cache is that it wouldn't support customisation at call time - you have to fully define the operation up front, the only thing you're deferring is the actual call. That call will only happen once, with all subsequent calls returning the same value as the initial call. This is what allows the call time overhead to be stripped back to almost nothing. Seems like a reasonable justification for a dedicated tool to me. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Masklinn, 23.03.2013 14:26:
Remember that any precompiled regex that got thrown out of the cache will be rebuilt as soon as it's being used. So the problem only ever arises when you really have more than _MAXCACHE different regexes that are all being used within the same loop, and even then, they'd have to be used in (mostly) the same order to draw the cache completely useless. That's a very rare case, IMHO. In all other cases, whenever the number of different regexes that are being used within a loop is lower than _MAXCACHE, the cache will immediately bring a substantial net win. And if a regex is not being used in a loop, then it's really unlikely that its compilation time will dominate the runtime of your application (assuming that your application is doing more than just compiling regexes...). Stefan

Hi, On Sat, Mar 23, 2013 at 12:42 AM, Gregory P. Smith <greg@krypto.org> wrote:
Sometimes it is, see e.g. http://bugs.python.org/issue11454. Best Regards, Ezio Melotti

On Fri, 22 Mar 2013 15:42:49 -0700 "Gregory P. Smith" <greg@krypto.org> wrote:
Precompiled regexes were a major contributor in Python startup time: http://bugs.python.org/issue13150 http://hg.python.org/cpython/rev/df950158dc33/ In the stdlib we should be rather careful about this. Third-party libraries can ignore such concerns if they aren't meant to use in interactive applications. Regards Antoine.

On 23.03.2013 13:18, Antoine Pitrou wrote:
Wouldn't it make sense to add a way to pickle or marshal compiled REs ? The precompiled REs could then be loaded directly from the pickle, avoiding the compiling overhead on startup. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 23.03.2013 13:43, Ronny Pfannschmidt wrote:
I wasn't thinking of making it part of the Python byte-code. It would suffice to add pickle/marshal support for the compiled RE code. This could then be loaded from a string embedded in the module code on startup. E.g. # rx = re.compile('.*') rx = pickle.loads('asdfsadfasdf') It would also be possible to seed the re module cache with such pickle.loads, perhaps compiled at Python build time. This would avoid having to change code in the stdlib to load pickles. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, Mar 23, 2013 at 11:52 PM, M.-A. Lemburg <mal@egenix.com> wrote:
What would that do to versioning? Currently, as I understand it, the compiled RE is a complete implementation detail; at any time, the re module can change how it stores it. Pickles (again, as I understand it - I may be wrong) should be readable on other versions of Python (forward-compatibly, at least), on other architectures, etc, etc; would this be a problem? Alternatively, at the expense of some storage space, there could be some kind of fallback. If the tag doesn't perfectly match the creating Python's tag, it ignores the dumped version and just compiles it as normal. Hmm. Here's a mad thought - a bit of latticed casementing, if you like. Could the compiled regexes be stored in the .pyc file? That already has version tagging done. All it'd take is some sort of extension mechanism that says "hey, here's some additional data that the pyc might want to make use of". Or would that overly complicate matters? ChrisA

On 23 March 2013 10:15, Chris Angelico <rosuav@gmail.com> wrote:
Pleas enote that compiled reg-expes can already be pickled straightforwardly. Unfortunatelly, to avoid the version issues you mention, from overlooking the pickled string, it looks like it just calls "re.compile" with the original regex on unpickle - so there would be no gain from the implementation as is. (I should stop being that lazy, and check what does unpickling a regexp actually does = Ah --Ezio found it while I was at it)
I can't see how this could be achieved but for adding a special syntax that would compile reg-exps at parsing time. Then, we might as well use Perl instead :-) But maybe some custom serializing could go straight into the sre_code that would proper serialize its objects as python-bytecode, and them some helper functions to load them from a custom made pyc file. These pre-generated pycs would be built at Python build time.
ChrisA _______________________________________________

On 23 March 2013 11:09, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
There it is, straight in re.py: import copyreg def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(_pattern_type, _pickle, _compile) So, pickling regexps as they are now are definitely no speed-up.

On Sun, Mar 24, 2013 at 1:09 AM, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Yeah, that's the most obvious form - some kind of regex literal syntax. I was thinking, though, that there might be some sort of extension to the pyc format that lets any module add precompiled data to it; the trouble would then be figuring out how to recognize what ought to get dumped into the pyc. It'd effectively need to be something that gets added to the code like: foo = re.compile('fo+') bar = re.compile('ba+r') re.precompile(foo,bar) That could then pre-populate some kind of cache that gets loaded with the pyc, and then when re.compile() gets a particular string, it looks it up in the cache and finds the precompiled version. Of course, this would quite possibly be more effort than it's worth. Complicating the pyc format in this way needs a lot of justification. ChrisA

On Sat, Mar 23, 2013 at 2:52 PM, M.-A. Lemburg <mal@egenix.com> wrote:
According to http://bugs.python.org/issue11454#msg170697, this would be twice as slow. Best Regards, Ezio Melotti

On 23.03.2013 14:53, Ezio Melotti wrote:
RE objects can already be pickled and that's also what was measured in that message. It doesn't actually pickle the RE "byte" code, though. Instead it just pickles the pattern and the flags and does a complete recompile when unpickling the RE object. I was talking about actually pickling the RE "byte" code that the re module generates to avoid the overhead of having to recompile the pattern. I'm pretty sure this would be faster :-)
On 23.03.2013 14:36, Oleg Broytman wrote:
The patterns used in the stdlib could be precompiled at Python build time and then stored away in a separate module, say _re_cache_preloader.py. This module would then be used to seed the re module cache. Since the cache works by looking at the patterns, no recompile would happen when calling re.compile() on these patterns. The approach is similar to the way the sysconfig module information is cached in a separate module to reduce startup time (something we also did in eGenix PyRun to improve startup time - and because we had to do it anyway, since there are no Makefile available to parse when running pyrun). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 23 Mar 2013 20:41:49 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
The problem is that for pickles to be durable, you would then need some kind of compatibility guarantee for the re bytecode. Otherwise you might add the bytecode version number to the pickle, and then ignore the bytecode when loading the pickle and the current version number is different; but that would mean people would lose the benefit of caching without being warned, which would make performance more fickle. Regards Antoine.

On 23.03.2013 20:41, Antoine Pitrou wrote:
Hmm, I'm not following you. The patterns would get compiled once at Python build time when installing the stdlib. The bytecode version wouldn't change for those compiled patterns - unless, of course, you upgrade to a new Python version, but then you'd rebuild the bytecode versions of the REs :-) To make them generally useful, I agree, you would have to add a RE compiler version to the bytecode pickle, but AFAICS this should not affect the usefulness for the stdlib RE cache. The whole idea is really very similar to the Python VM bytecode caching Python is using to speedup imports of modules. Perhaps we could have a GSoC student give it a try and see whether it makes results in noticable startup time speedups ?! -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 23 Mar 2013 22:19:02 +0100 "M.-A. Lemburg" <mal@egenix.com> wrote:
Ah, you're talking only about the stdlib. Well, sure, that would work, but we have to remember to regenerate those pickles by hand each time the re bytecode is updated (which doesn't happen often, admittedly). That's a bit of a maintenance burden.
The whole idea is really very similar to the Python VM bytecode caching Python is using to speedup imports of modules.
Except that the VM bytecode caching works automatically and transparently :-)
Perhaps we could have a GSoC student give it a try and see whether it makes results in noticable startup time speedups ?!
That's a rather smallish topic for a GSoC project, IMHO. Regards Antoine.

On 23.03.2013 22:20, Antoine Pitrou wrote:
No, that would happen at build time automatically. setup.py would create the module with the pickled RE bytecodes by scanning the stdlib modules for RE patterns, the re module would use this to seed its cache. That's the high-level idea. I'm sure there are a few pitfalls along the way :-)
Should be the same for the REs in the stdlib. The user wouldn't notice (except for the speedup hopefully). Code in the stdlib compiling the REs wouldn't need to be touched either, since the cache in the re module would simply reuse the compiled versions.
Well, you could extend it by adding some RE optimization tasks on top of it :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 23 2013)
2013-03-13: Released eGenix pyOpenSSL 0.13 ... http://egenix.com/go39 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, Mar 23, 2013 at 01:38:19PM +0100, "M.-A. Lemburg" <mal@egenix.com> wrote:
But with an overhead of opening files and unpickling. My wild guess [not backed by numbers] is that it would be as slow. I suspect the only way to speedup things would be to precompile regexps at compile (generating .pyc) time and saving compiled regexps in the byte code file. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

To summarize: - compiling regexes is slow so applications frequently compute it once and save it - compiling all the regexes at startup slows down startup for regexes that may never be used - a common pattern is to compute once at time of use and it would be nice to optimize this pattern - the regex library has a cache feature which means that frequently it will be optimized automatically - however, there's no guarantee that the regex you care about won't fall out of the cache. I think this addresses all the issues better than compute_lazy: re.compile(r'...', keep=True) When keep=True is specified, the regex library keeps the cached value for the lifetime of the process. The regex is computed only once on first use and you don't need to create a place to store it. Furthermore, if you use the same regex in more than one place, once with keep=True, the other uses will automatically be optimized. --- Bruce Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

keep=True defeats the purpose of a caching strategy. An re.compile call within some code somewhere is typically not in a position to know if it is going to be called a lot. I think the code, as things are now, with dynamic construction at runtime based on a simple test is the best of both worlds to avoid the more complicated cost of calling re.compile and going through its cache logic. If the caching is ever is improved in the future to be faster, the code can arguably be simplified to use re.search or re.match directly and rely solely on the caching. ie: don't change anything. On Sat, Mar 23, 2013 at 4:03 PM, Bruce Leban <bruce@leapyear.org> wrote:
participants (17)
-
Antoine Pitrou
-
Bruce Leban
-
Chris Angelico
-
Eli Bendersky
-
Ezio Melotti
-
Gregory P. Smith
-
Joao S. O. Bueno
-
M.-A. Lemburg
-
Masklinn
-
Nick Coghlan
-
Oleg Broytman
-
Richard Oudkerk
-
Ronny Pfannschmidt
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy