[Python-ideas] Fast global cacheless lookup

Nov. 22, 2007

      I have a hack coded up against r59068 in which LOAD_GLOBAL is even 
faster than LOAD_FAST. It'll be the same with STORE_GLOBAL and the 
*_NAME opcodes after I'm done with it, and it should be fully 
transparent to Python code. (That is, you can go ahead and swap out 
__builtins__ and crazy junk like that and everything should work as it 
did before.) Regression tests all pass, except test_gc on functions - 
I've got a refcount bug somewhere.

Here's the microbenchmark I've been using to test LOAD_GLOBAL and LOAD_FAST:

import timeit
import dis

def test_local_get():
     x = 0
     x; x; x; #... and 397 more of them

if __name__ == '__main__':
     print dis.dis(test_local_get.func_code)
     print timeit.Timer('test_local_get()',
             'from locals_test import test_local_get').timeit()

The globals test puts 'x' in module scope, and the builtins test changes 
'x' to 'len' and doesn't assign it to 0.

Output right now:

r59068 locals: 15.57 sec
myhack locals: 15.61 sec (increase is probably insignificant or random)

r59068 globals: 23.61 sec
myhack globals: 15.14 sec (!)

r59068 builtins: 28.08 sec
myhack builtins: 15.26 sec (!!)

Of course, it's no good if it slows everything else way the heck down. 
So 10 rounds of pybench says:

r59068: mean 8.92, std 0.05
myhack: mean 8.99, std 0.04

 From what I see in pybench, globals access is severely underrepresented 
compared to real programs, so those numbers aren't representative of the 
possible difference in real-life performance.

Jim Jewett gave me the idea here:

http://mail.python.org/pipermail/python-ideas/2007-November/001207.html

"Note that weakening the module.__dict__ promise to only meeting the 
dict API would make it easier to implement the various speed-up-globals 
suggestions."

I didn't exactly do that, but it did get me thinking. The other 
proposals for speeding up globals access seemed to do their darndest to 
leave PyDictObject alone and ended up hideously complicated because of 
it. Here's the main idea for this one: What if a frame could maintain an 
array of pointers right into a dictionary's entry table? A global lookup 
would then consist of a couple of pointer dereferences, and any value 
change would show up immediately to the frame.

There was a dangerous dangling pointer problem inherent in that, so I 
formalized an update mechanism using an observer pattern.

Here's how it works. Arbitrary objects can register themselves with a 
dictionary as "entry observers". The dictionary keeps track of all the 
registered observers, and for certain events, makes a call to each one 
to tell them that something has changed. The entry observers get 
pointers to entries via PyDict_GetEntry, which is just like 
PyDict_GetItem, except it returns a PyDictEntry * right from the 
dictionary's entry table.

The dict notifies its observers on delitem, pop, popitem, resize and 
clear. Nothing else is necessary - nothing else will change the address 
of or invalidate an entry. There are very, very few changes in 
PyDictObject. In the general case, the pointer to the list of observers 
is NULL, and the only additional slowdown is when delitem, pop, popitem, 
resize and clear check that and move on - but those aren't called often.

So get, set, iter, contains, etc., are all exactly as fast as they were 
before. The biggest performance hit is when a highly-observed dict like 
__builtin__.__dict__ resizes, but that's rare enough to not worry about.

To speed up globals access, an auxiliary object to functions and frames 
registers itself as an observer to func_globals and __builtins__. It 
makes an array of PyDictEntry pointers corresponding to 
func_code.co_names. PyEval_EvalFrameEx indexes that array first for 
global values, and updates it if there's one it couldn't find when the 
function was created.

That's pretty much it. There are corner cases I still have to address, 
like what happens if someone replaces or deletes __builtins__, but it 
should be fairly easy to monitor that.

I'd love to hear your comments, everyone. I've glossed over a lot of 
implementation details, but I've tried to make the main ideas clear.

Neil

[Python-ideas] Fast global cacheless lookup

Neil Toronto