Piling more complexity on is likely to slow down the common case though.
Not at all. The common case (the macro) will not have a single instruction added. The rare case (fallback function) will be slowed down by two instructions: decrement a global variable, jump if zero. The reshuffle will be triggered every 100th occurence of the rare case (which may be every 10000th occurence of the common case).
It's the same approach used when designing a memory cache for a CPU: it's OK if a cache miss is made slower by loading an entire cache row as long as it improves the cache hit rate enough to justify it.
OK, OK. I timed pystone, and it's about 1.5% faster. If that's the best we can do (and I think that any of the PEPs trying to optimize global and builtin access probably have about the same improvement rate as your code), I'm not sure we should worry about this. I'd rather try to make method calls and instance attribute (both methods and instance variables) access faster...
For each dictionary, the number of negative entries is bound by the number of global and builtin names in the co_names of all code objects that get executed with that dictionary as their namespace.
And that's my worry. Suppose a program has only one global but touches every builtin. Will the dictionary properly get resized to accommodate all those negative entries?
len(dir(__builtins__))*24 # sizeof PyDictEntry on i386 2736
Caching of string hashes also costs memory. Speed optimizations often do. These costs need to be weighed against complexity, performance, and even the code size - other alternatives may use up as much code space as this one uses data in this worst-case scenario.
I wasn't worried about size; I was worried about correctness. But it appears that your code is fine. BTW, here's a benchmarking tip. Instead of timing this: for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; you should time the for loop in this code: x = [0]*10000000 for i in x: hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; The xrange iterator allocates (and implicitly deallocates) an int per iteration, and you don't use that int at all. (10 million pointers is 40 MB -- if that's too much, it's OK to test with 1 million iterations. Quick outcome: CVS python 2.3: builtin: 8.480 global: 6.780 local: 5.080 fast: 2.890 With your patch: builtin: 4.660 global: 4.160 local: 4.020 fast: 3.000 Hm, this reports: 0 inline dictionary lookups 1568 fast dictionary lookups 121 slow dictionary lookups 0.00% inline 7.16% slow created 308 string dicts converted 32 to normal dicts 10.39% conversion rate Somehow the counts seem to be off by several orders of magnitude. What *do* these numbers report? Also note that your patch slows down fast access (by 3%)! How can it? Adding more code to the interpreter's inner loop changes the cache behavior, etc. Tim Peters can tell you more about this. --Guido van Rossum (home page: http://www.python.org/~guido/)