
I have a hack coded up against r59068 in which LOAD_GLOBAL is even faster than LOAD_FAST. It'll be the same with STORE_GLOBAL and the *_NAME opcodes after I'm done with it, and it should be fully transparent to Python code. (That is, you can go ahead and swap out __builtins__ and crazy junk like that and everything should work as it did before.) Regression tests all pass, except test_gc on functions - I've got a refcount bug somewhere. Here's the microbenchmark I've been using to test LOAD_GLOBAL and LOAD_FAST: import timeit import dis def test_local_get(): x = 0 x; x; x; #... and 397 more of them if __name__ == '__main__': print dis.dis(test_local_get.func_code) print timeit.Timer('test_local_get()', 'from locals_test import test_local_get').timeit() The globals test puts 'x' in module scope, and the builtins test changes 'x' to 'len' and doesn't assign it to 0. Output right now: r59068 locals: 15.57 sec myhack locals: 15.61 sec (increase is probably insignificant or random) r59068 globals: 23.61 sec myhack globals: 15.14 sec (!) r59068 builtins: 28.08 sec myhack builtins: 15.26 sec (!!) Of course, it's no good if it slows everything else way the heck down. So 10 rounds of pybench says: r59068: mean 8.92, std 0.05 myhack: mean 8.99, std 0.04 From what I see in pybench, globals access is severely underrepresented compared to real programs, so those numbers aren't representative of the possible difference in real-life performance. Jim Jewett gave me the idea here: http://mail.python.org/pipermail/python-ideas/2007-November/001207.html "Note that weakening the module.__dict__ promise to only meeting the dict API would make it easier to implement the various speed-up-globals suggestions." I didn't exactly do that, but it did get me thinking. The other proposals for speeding up globals access seemed to do their darndest to leave PyDictObject alone and ended up hideously complicated because of it. Here's the main idea for this one: What if a frame could maintain an array of pointers right into a dictionary's entry table? A global lookup would then consist of a couple of pointer dereferences, and any value change would show up immediately to the frame. There was a dangerous dangling pointer problem inherent in that, so I formalized an update mechanism using an observer pattern. Here's how it works. Arbitrary objects can register themselves with a dictionary as "entry observers". The dictionary keeps track of all the registered observers, and for certain events, makes a call to each one to tell them that something has changed. The entry observers get pointers to entries via PyDict_GetEntry, which is just like PyDict_GetItem, except it returns a PyDictEntry * right from the dictionary's entry table. The dict notifies its observers on delitem, pop, popitem, resize and clear. Nothing else is necessary - nothing else will change the address of or invalidate an entry. There are very, very few changes in PyDictObject. In the general case, the pointer to the list of observers is NULL, and the only additional slowdown is when delitem, pop, popitem, resize and clear check that and move on - but those aren't called often. So get, set, iter, contains, etc., are all exactly as fast as they were before. The biggest performance hit is when a highly-observed dict like __builtin__.__dict__ resizes, but that's rare enough to not worry about. To speed up globals access, an auxiliary object to functions and frames registers itself as an observer to func_globals and __builtins__. It makes an array of PyDictEntry pointers corresponding to func_code.co_names. PyEval_EvalFrameEx indexes that array first for global values, and updates it if there's one it couldn't find when the function was created. That's pretty much it. There are corner cases I still have to address, like what happens if someone replaces or deletes __builtins__, but it should be fairly easy to monitor that. I'd love to hear your comments, everyone. I've glossed over a lot of implementation details, but I've tried to make the main ideas clear. Neil