[Python-Dev] Caching of builtins and globals
Samuele Pedroni
pedronis@bluewin.ch
Thu, 12 Jun 2003 15:23:12 +0200
I have played a bit with the notion of caching builtin and global values
(to be precise the address of their dictionary entry) without changing
language semantics:
I have an experimental modification to the interpreteter such that I get
these numbers: [bl_ is the baseline CVS code, ex_ is the modified interpreter]:
$ ./bl_python test.py
nop func
5.26905059814 ms
5.17392158508 ms
5.10895252228 ms
all-locals func
3537.45496273 ms
3536.81099415 ms
3537.69803047 ms
global/builtin-heavy func
4377.18105316 ms
4378.40998173 ms
4375.61702728 ms
sup ctr
57391.8089867 ms
57239.0290499 ms
57295.1930761 ms
$ ./ex_python test.py
nop func
5.01906871796 ms
5.34105300903 ms
5.30099868774 ms
all-locals func
3505.41305542 ms
3529.50799465 ms
3529.50108051 ms
global/builtin-heavy func
3750.57899952 ms
3752.10404396 ms
3725.95608234 ms
sup ctr
47528.2179117 ms
47529.8720598 ms
47295.0160503 ms
$
for
<test.py>
R=3
N = 5000
M = 1000
import time
ch = 'T'
def bot():
pass
class d(dict):
def __init__(self):
dict.__init__(self,{1:1})
def f():
ord_ = ord
ch_ = ch
x = 0
for i in range(M):
x += ord_(ch_)
def g():
x = 0
for i in range(M):
x += ord(ch)
def h():
dic = None
for i in range(M):
dic = d()
def test(func,n):
t = time.time()
for i in xrange(N):
func()
print (time.time()-t)*1000, ' ms'
print "nop func"
for i in range(R):
test(bot,N)
print "all-locals func"
for i in range(R):
test(f,N)
print "global/builtin-heavy func"
for i in range(R):
test(g,N)
print "sup ctr"
for i in range(R):
test(h,100)
</test.py>
[AMD Athlon 750Mhz under RH Linux 7.2, pystone seems to get 4/5%
improvements, neverthelss I expect slowdowns in some cases]
idea/issues:
- the most "evil" thing I do is add a further pointer to all dicts to store
what I call a version token. Reshaping* mutation paths for dicts grow code
for updating the version token, this is executed conditionally only if the
version token pointer is non-null, i.e. the version token was asked at
least once and so one was put in place. So mutation for normal dicts (not
used as globals/builtins) pay only to skip the code with a
check-non-null-jump. [version tokens are small objects with their own ref
counting logic, their are not version counters for reasons related to
overflow and so that they capture also the identity of their dicts:
dic1 != dic2 => dic1.version != dic2.version ].
Using different dict kinds could avoid the all-pay problem. This is slowed
down:
import mod
mod.foo = 1
if 'foo' doesn't already exist in mod because a new version token must be
generated.
[*: empty entry to non-empty, the converse, and things that change
key->entry (address) mapping (e.g. resizing)]
- I didn't touch the compiler/bytecode. I allocate caches with
len(co_names) entries. LOAD_GLOBAL args run over [0,len(co_names)[.
- the basic invariant is that as long as the version tokens of a pair of
globals/builtins dicts is unchanged, a LOAD_GLOBAL would retrieve its value
*at the same* dict entry address. So the LOAD_GLOBAL fast-path becomes:
case LOAD_GLOBAL:
if (((PyDictObject *)f->f_globals)->version == glover &&
((PyDictObject *)f->f_builtins)->version == bltver) {
if(glocache[oparg]) {
TRC_GLOCACHE("using cache\n");
x = glocache[oparg]->me_value;
Py_INCREF(x);
PUSH(x);
continue;
}
} else
- caches are wrapped in CObjects (quick hack) to get proper collection.
They are stored and shared through code objects together with the
globals/builtins version tokens they correspond to. The invariant and that
a cache is *only* used for a given pair of version tokens, should guarantee
that sharing is fine. If the versions tokens change during the run of some
code a new fresh cache is created and used. The assumptions is that this is
rare after "startup" phase.
Code that pay a price are e.g. modules that in their top-level setup some
globals, call a function, add some more globals, recall the function...
because each function call will create and start populating a new fresh cache.
- although I have tried to trigger DECREFs (that could run arbitrary code)
only from consistent state, that's a delicate aspect. Similar is the
problem to have version tokens timely updated.
- ceval switch is indeed an unwieldy beast. For sure I did'nt manage (nor
really tried) to find the best code layout with respect to cpu cache/code
generation. The problem is that slowdowns for not-fast-paths (and sadly
even unrelated stuff) depend on this.
It's a two day work (& my C is a bit rusty) so some bug/issues could be
lurking, OTOH the test suite passes. I have no more time to further play
with this, is not really in general consumption shape but if someone is
interested I can post the diff (is less than 1000 lines together with context).
regards.