[Python-ideas] Fast global cacheless lookup
ntoronto at cs.byu.edu
Thu Nov 22 16:40:49 CET 2007
I have a hack coded up against r59068 in which LOAD_GLOBAL is even
faster than LOAD_FAST. It'll be the same with STORE_GLOBAL and the
*_NAME opcodes after I'm done with it, and it should be fully
transparent to Python code. (That is, you can go ahead and swap out
__builtins__ and crazy junk like that and everything should work as it
did before.) Regression tests all pass, except test_gc on functions -
I've got a refcount bug somewhere.
Here's the microbenchmark I've been using to test LOAD_GLOBAL and LOAD_FAST:
x = 0
x; x; x; #... and 397 more of them
if __name__ == '__main__':
'from locals_test import test_local_get').timeit()
The globals test puts 'x' in module scope, and the builtins test changes
'x' to 'len' and doesn't assign it to 0.
Output right now:
r59068 locals: 15.57 sec
myhack locals: 15.61 sec (increase is probably insignificant or random)
r59068 globals: 23.61 sec
myhack globals: 15.14 sec (!)
r59068 builtins: 28.08 sec
myhack builtins: 15.26 sec (!!)
Of course, it's no good if it slows everything else way the heck down.
So 10 rounds of pybench says:
r59068: mean 8.92, std 0.05
myhack: mean 8.99, std 0.04
From what I see in pybench, globals access is severely underrepresented
compared to real programs, so those numbers aren't representative of the
possible difference in real-life performance.
Jim Jewett gave me the idea here:
"Note that weakening the module.__dict__ promise to only meeting the
dict API would make it easier to implement the various speed-up-globals
I didn't exactly do that, but it did get me thinking. The other
proposals for speeding up globals access seemed to do their darndest to
leave PyDictObject alone and ended up hideously complicated because of
it. Here's the main idea for this one: What if a frame could maintain an
array of pointers right into a dictionary's entry table? A global lookup
would then consist of a couple of pointer dereferences, and any value
change would show up immediately to the frame.
There was a dangerous dangling pointer problem inherent in that, so I
formalized an update mechanism using an observer pattern.
Here's how it works. Arbitrary objects can register themselves with a
dictionary as "entry observers". The dictionary keeps track of all the
registered observers, and for certain events, makes a call to each one
to tell them that something has changed. The entry observers get
pointers to entries via PyDict_GetEntry, which is just like
PyDict_GetItem, except it returns a PyDictEntry * right from the
dictionary's entry table.
The dict notifies its observers on delitem, pop, popitem, resize and
clear. Nothing else is necessary - nothing else will change the address
of or invalidate an entry. There are very, very few changes in
PyDictObject. In the general case, the pointer to the list of observers
is NULL, and the only additional slowdown is when delitem, pop, popitem,
resize and clear check that and move on - but those aren't called often.
So get, set, iter, contains, etc., are all exactly as fast as they were
before. The biggest performance hit is when a highly-observed dict like
__builtin__.__dict__ resizes, but that's rare enough to not worry about.
To speed up globals access, an auxiliary object to functions and frames
registers itself as an observer to func_globals and __builtins__. It
makes an array of PyDictEntry pointers corresponding to
func_code.co_names. PyEval_EvalFrameEx indexes that array first for
global values, and updates it if there's one it couldn't find when the
function was created.
That's pretty much it. There are corner cases I still have to address,
like what happens if someone replaces or deletes __builtins__, but it
should be fairly easy to monitor that.
I'd love to hear your comments, everyone. I've glossed over a lot of
implementation details, but I've tried to make the main ideas clear.
More information about the Python-ideas