Inexplicable timings
MRAB
google at mrabarnett.plus.com
Fri Aug 1 19:43:53 EDT 2008
I'm looking at the implementation of regular expressions in Python and
wrote a script to test my changes. This is the script:
import re
import time
base = "abc"
final = "d"
for n in [100, 1000, 10000]:
for f in [final, ""]:
for r in ["+", "+?"]:
pattern = "(?:%s)%s%s" % (base, r, final)
text_format = "(?:%s){%d}%s" % (base, n, f)
regex = re.compile(pattern)
text = base * n + f
start_time = time.clock()
found = regex.search(text)
finish_time = time.clock()
duration = finish_time - start_time
print "%s\t:: %s\t:: %.3f" % (text_format, pattern,
duration)
I've found some puzzling behaviour that I can't explain. If I delete
sre_parse.pyc or sre_compile.pyc before running the script I get
timings like this:
(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.022
(?:abc){1000} :: (?:abc)+?d :: 0.020
(?:abc){10000}d :: (?:abc)+d :: 0.001
(?:abc){10000}d :: (?:abc)+?d :: 0.000
(?:abc){10000} :: (?:abc)+d :: 1.961
(?:abc){10000} :: (?:abc)+?d :: 1.756
Subsequent runs give timings like this:
(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.020
(?:abc){1000} :: (?:abc)+?d :: 0.020
(?:abc){10000}d :: (?:abc)+d :: 0.001
(?:abc){10000}d :: (?:abc)+?d :: 0.000
(?:abc){10000} :: (?:abc)+d :: 3.953
(?:abc){10000} :: (?:abc)+?d :: 1.763
The second-from-last takes twice as long!
This is repeatable.
Python 2.5.2, on the other hand, gives timings like this:
(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.044
(?:abc){1000} :: (?:abc)+?d :: 0.027
(?:abc){10000}d :: (?:abc)+d :: 0.002
(?:abc){10000}d :: (?:abc)+?d :: 0.002
(?:abc){10000} :: (?:abc)+d :: 4.547
(?:abc){10000} :: (?:abc)+?d :: 2.858
repeatably, irrespective of whether I delete the .pyc files first.
Why is only one timing affected, and why by so much? Python quits
after each run, so it can't be anything cumulative, surely. I've
considered things like caching, for example compiling the .pyc files
on the first run, but can't see why (or how) that would cause what I
see.
Can anyone suggest a factor that I might be missing? Any suggestions
welcome!
Matthew
More information about the Python-list
mailing list