Hi, I have attached a patch at: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1739789...
A common optimization tip for Python code is to use locals rather than globals. This converts dictionary lookups of (interned) strings to tuple indexing. I have created a patch that achieves this speed benefit "automatically" for all globals and builtins, by adding a feature to dictobjects.
Additionally, the idea of this patch is that it puts down the necessary infrastructure to also allow virtually all attribute accesses to also be accelerated in the same way (with some extra work, of course).
I have already suggested this before but I got the impression that the spirit of the replies was "talk is cheap, show us the code/benchmarks". So I wrote some code.
Getting the changes to work was not easy, and required learning about the nuances of dictobject's, their re-entrancy issues, etc. These changes do slow down dictobjects, but it seems that this slowdown is more than offset by the speed increase of builtins/globals access.
A set of benchmarks that repeatedly perform: A. Global reads B. Global writes C. Builtin reads with little overheads (just repeatedly issuing global/builtin access bytecodes, many times per loop iteration to minimize the loop overhead), yield 30% time decrease (~42% speed increase).
Regression tests take ~62 seconds (of user+sys time) with Python2.6 trunk Regression tests take ~65 seconds (of user+sys time) with the patch Regression tests are about ~4.5% slower. (Though Regression tests probably spend their running time on a lot more code than other programs, so are not a good benchmark, which spends more time instantiating function objects, and less time executing them)
pystone seems to be improved by about 5%.
My conclusions: The LOAD_GLOBAL/STORE_GLOBAL opcodes are considerably faster. Dict accesses or perhaps the general extra activity around seem to be insignificantly slower, or at least cancel out against the speed benefits in the regression tests.
The next step I am going to try, is to replace the PyObject_GetAttr call with code that: * Calls PyObject_GetAttr only if GenericGetAttr is not the object's handler, as to allow modifying the behaviour. * Otherwise, remember for each attribute-accessing opcode, the last type from which the attribute was accessed. A single pointer comparison can check if the attribute access is using the same type. In case it does, it can use a stored exported key from the type dictionary [or from an mro cache dictionary for that type, if that is added], rather than a dict lookup. If it yields the same speed benefit, it could make attribute access opcodes up-to 42% faster as well, when used on the same types (which is probably the common case, particularly in inner loops).
This will allow, with the combination of __slots__, to eliminate all dict lookups for most instance-side accesses as well.
P.S: I discovered a lot of code duplication (and "went along" and duplicated my code in the same spirit), but was wondering if a patch that utilized C's preprocessor heavily to prevent code duplication in CPython's code, and trusting the "inline" keyword to prevent thousands of lines in the same function (ceval.c's opcode switch) would be accepted.