On 23/03/13 09:31, Ronny Pfannschmidt wrote:
Hi,
while reviewing urllib.parse i noticed a pretty ugly pattern
many functions had an attached global and in their own code they would compile an regex on first use and assign it to that global
Since there are only eight of them, and none of them are particularly big or complicated regexes, I believe that they should be unconditionally compiled at startup. E.g. instead of this code: # from urllib.parse _typeprog = None def splittype(url): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" global _typeprog if _typeprog is None: import re _typeprog = re.compile('^([^/:]+):') match = _typeprog.match(url) [...] something like this is much simpler and more obvious: _typeprog = re.compile('^([^/:]+):') def splittype(url): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" match = _typeprog.match(url) [...] or we can get rid of the global altogether: def splittype(url, _typeprog=re.compile('^([^/:]+):')): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" match = _typeprog.match(url) [...]
its clear that compiling a regex is expensive, so having them be compiled later at first use would be of some benefit
Sounds like premature optimization to me. I've extracted out the regex patterns, and timed how long it takes to compile them. On my computer, it's a one-off cost of about 3 milliseconds. Here's the code I used to time it: # === start === import re import timeit patterns = [ ('^([^/:]+):', ), ('^//([^/?]*)(.*)$', ), ('^(.*)@(.*)$', ), ('^([^:]*):(.*)$', re.S), ('^(.*):([0-9]+)$', ), ('^(.*):(.*)$', ), ('^(.*)\?([^?]*)$', ), ('^(.*)#([^#]*)$', ), ('^([^=]*)=(.*)$', ), ] setup = "from __main__ import re, patterns" t1 = timeit.Timer(""" re.purge() for pat in patterns: re.compile(*pat) """, setup) t2 = timeit.Timer(""" re.purge() for pat in patterns: pass """, setup) print("compiling urllib.parse patterns:") a = min(t1.repeat(number=1000, repeat=7)) b = min(t2.repeat(number=1000, repeat=7)) print(a-b, "ms") # === end === Admittedly this would (probably) increase the time taken to import urllib.parse by about 50% (from 5-6ms to 8-9ms), but I don't see this as significant. It's not even a real cost -- you still have to compile the patterns at some point, the only question is whether they are done up front or as needed. -- Steven