[Python-ideas] re.compile_lazy - on first use compiled regexes
Steven D'Aprano
steve at pearwood.info
Tue Mar 26 00:55:01 CET 2013
On 23/03/13 09:31, Ronny Pfannschmidt wrote:
> Hi,
>
> while reviewing urllib.parse i noticed a pretty ugly pattern
>
> many functions had an attached global and in their own code they would compile an regex on first use and assign it to that global
Since there are only eight of them, and none of them are particularly big or complicated regexes, I believe that they should be unconditionally compiled at startup. E.g. instead of this code:
# from urllib.parse
_typeprog = None
def splittype(url):
"""splittype('type:opaquestring') --> 'type', 'opaquestring'."""
global _typeprog
if _typeprog is None:
import re
_typeprog = re.compile('^([^/:]+):')
match = _typeprog.match(url)
[...]
something like this is much simpler and more obvious:
_typeprog = re.compile('^([^/:]+):')
def splittype(url):
"""splittype('type:opaquestring') --> 'type', 'opaquestring'."""
match = _typeprog.match(url)
[...]
or we can get rid of the global altogether:
def splittype(url, _typeprog=re.compile('^([^/:]+):')):
"""splittype('type:opaquestring') --> 'type', 'opaquestring'."""
match = _typeprog.match(url)
[...]
> its clear that compiling a regex is expensive, so having them be compiled later at first use would be of some benefit
Sounds like premature optimization to me. I've extracted out the regex patterns, and timed how long it takes to compile them. On my computer, it's a one-off cost of about 3 milliseconds. Here's the code I used to time it:
# === start ===
import re
import timeit
patterns = [
('^([^/:]+):', ),
('^//([^/?]*)(.*)$', ),
('^(.*)@(.*)$', ),
('^([^:]*):(.*)$', re.S),
('^(.*):([0-9]+)$', ),
('^(.*):(.*)$', ),
('^(.*)\?([^?]*)$', ),
('^(.*)#([^#]*)$', ),
('^([^=]*)=(.*)$', ),
]
setup = "from __main__ import re, patterns"
t1 = timeit.Timer("""
re.purge()
for pat in patterns:
re.compile(*pat)
""", setup)
t2 = timeit.Timer("""
re.purge()
for pat in patterns:
pass
""", setup)
print("compiling urllib.parse patterns:")
a = min(t1.repeat(number=1000, repeat=7))
b = min(t2.repeat(number=1000, repeat=7))
print(a-b, "ms")
# === end ===
Admittedly this would (probably) increase the time taken to import urllib.parse by about 50% (from 5-6ms to 8-9ms), but I don't see this as significant. It's not even a real cost -- you still have to compile the patterns at some point, the only question is whether they are done up front or as needed.
--
Steven
More information about the Python-ideas
mailing list