[Python-ideas] re.compile_lazy - on first use compiled regexes

Tue Mar 26 00:55:01 CET 2013

On 23/03/13 09:31, Ronny Pfannschmidt wrote:
> Hi,
>
> while reviewing urllib.parse i noticed a pretty ugly pattern
>
> many functions had an attached global and in their own code they would compile an regex on first use and assign it to that global

Since there are only eight of them, and none of them are particularly big or complicated regexes, I believe that they should be unconditionally compiled at startup. E.g. instead of this code:

# from urllib.parse

_typeprog = None
def splittype(url):
     """splittype('type:opaquestring') --> 'type', 'opaquestring'."""
     global _typeprog
     if _typeprog is None:
         import re
         _typeprog = re.compile('^([^/:]+):')
     match = _typeprog.match(url)
     [...]

something like this is much simpler and more obvious:

_typeprog = re.compile('^([^/:]+):')
def splittype(url):
     """splittype('type:opaquestring') --> 'type', 'opaquestring'."""
     match = _typeprog.match(url)
     [...]

or we can get rid of the global altogether:

def splittype(url, _typeprog=re.compile('^([^/:]+):')):
     """splittype('type:opaquestring') --> 'type', 'opaquestring'."""
     match = _typeprog.match(url)
     [...]

> its clear that compiling a regex is expensive, so having them be compiled later at first use would be of some benefit

Sounds like premature optimization to me. I've extracted out the regex patterns, and timed how long it takes to compile them. On my computer, it's a one-off cost of about 3 milliseconds. Here's the code I used to time it:

# === start ===
import re
import timeit

patterns = [
     ('^([^/:]+):', ),
     ('^//([^/?]*)(.*)$', ),
     ('^(.*)@(.*)$', ),
     ('^([^:]*):(.*)$', re.S),
     ('^(.*):([0-9]+)$', ),
     ('^(.*):(.*)$', ),
     ('^(.*)\?([^?]*)$', ),
     ('^(.*)#([^#]*)$', ),
     ('^([^=]*)=(.*)$', ),
     ]

setup = "from __main__ import re, patterns"

t1 = timeit.Timer("""
re.purge()
for pat in patterns:
     re.compile(*pat)
""", setup)

t2 = timeit.Timer("""
re.purge()
for pat in patterns:
     pass
""", setup)

print("compiling urllib.parse patterns:")
a = min(t1.repeat(number=1000, repeat=7))
b = min(t2.repeat(number=1000, repeat=7))
print(a-b, "ms")
# === end ===

Admittedly this would (probably) increase the time taken to import urllib.parse by about 50% (from 5-6ms to 8-9ms), but I don't see this as significant. It's not even a real cost -- you still have to compile the patterns at some point, the only question is whether they are done up front or as needed.

-- 
Steven