
Hi, I have attached a patch at: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1739789&group_id=5470 A common optimization tip for Python code is to use locals rather than globals. This converts dictionary lookups of (interned) strings to tuple indexing. I have created a patch that achieves this speed benefit "automatically" for all globals and builtins, by adding a feature to dictobjects. Additionally, the idea of this patch is that it puts down the necessary infrastructure to also allow virtually all attribute accesses to also be accelerated in the same way (with some extra work, of course). I have already suggested this before but I got the impression that the spirit of the replies was "talk is cheap, show us the code/benchmarks". So I wrote some code. Getting the changes to work was not easy, and required learning about the nuances of dictobject's, their re-entrancy issues, etc. These changes do slow down dictobjects, but it seems that this slowdown is more than offset by the speed increase of builtins/globals access. Benchmarks: A set of benchmarks that repeatedly perform: A. Global reads B. Global writes C. Builtin reads with little overheads (just repeatedly issuing global/builtin access bytecodes, many times per loop iteration to minimize the loop overhead), yield 30% time decrease (~42% speed increase). Regression tests take ~62 seconds (of user+sys time) with Python2.6 trunk Regression tests take ~65 seconds (of user+sys time) with the patch Regression tests are about ~4.5% slower. (Though Regression tests probably spend their running time on a lot more code than other programs, so are not a good benchmark, which spends more time instantiating function objects, and less time executing them) pystone seems to be improved by about 5%. My conclusions: The LOAD_GLOBAL/STORE_GLOBAL opcodes are considerably faster. Dict accesses or perhaps the general extra activity around seem to be insignificantly slower, or at least cancel out against the speed benefits in the regression tests. The next step I am going to try, is to replace the PyObject_GetAttr call with code that: * Calls PyObject_GetAttr only if GenericGetAttr is not the object's handler, as to allow modifying the behaviour. * Otherwise, remember for each attribute-accessing opcode, the last type from which the attribute was accessed. A single pointer comparison can check if the attribute access is using the same type. In case it does, it can use a stored exported key from the type dictionary [or from an mro cache dictionary for that type, if that is added], rather than a dict lookup. If it yields the same speed benefit, it could make attribute access opcodes up-to 42% faster as well, when used on the same types (which is probably the common case, particularly in inner loops). This will allow, with the combination of __slots__, to eliminate all dict lookups for most instance-side accesses as well. P.S: I discovered a lot of code duplication (and "went along" and duplicated my code in the same spirit), but was wondering if a patch that utilized C's preprocessor heavily to prevent code duplication in CPython's code, and trusting the "inline" keyword to prevent thousands of lines in the same function (ceval.c's opcode switch) would be accepted.

On 19/06/2007 17.13, Eyal Lotem wrote:
How does it compare with this patch: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1616125&group_id=5470 ? -- Giovanni Bajo

I haven't compared benchmarks, but I strongly suspect that my patch is not as fast for "real" programs, as in real programs, globals/builtins are almost only exclusively read, and almost never written to. His patch accelerates global reads by as much as mine does, without making function object creation as expensive, and his addition of overhead to dicts is probably smaller. My patch also accelerates writes, but as I said, that will go nearly unnoticed normally. My patch is not the end, but a means to an end. If the purpose is only accelerating globals/builtins access, then the patch you linked to is simpler and better. The purpose I aim for, however, is to later use the same technique to also accelerate access to the dicts in the type or even in the instance, by specializing them in function objects. This will allow to get rid of almost all attribute lookups in dicts. Combined with the use of __slots__ in all classes, no dict lookups will be used at all for attributes at all, except in "getattr" calls. Combined with an mro cache, this should put Python very close to C in terms of attribute access speed (simple direct access). Eyal On 6/28/07, Giovanni Bajo <rasky@develer.com> wrote:

On 19/06/2007 17.13, Eyal Lotem wrote:
How does it compare with this patch: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1616125&group_id=5470 ? -- Giovanni Bajo

I haven't compared benchmarks, but I strongly suspect that my patch is not as fast for "real" programs, as in real programs, globals/builtins are almost only exclusively read, and almost never written to. His patch accelerates global reads by as much as mine does, without making function object creation as expensive, and his addition of overhead to dicts is probably smaller. My patch also accelerates writes, but as I said, that will go nearly unnoticed normally. My patch is not the end, but a means to an end. If the purpose is only accelerating globals/builtins access, then the patch you linked to is simpler and better. The purpose I aim for, however, is to later use the same technique to also accelerate access to the dicts in the type or even in the instance, by specializing them in function objects. This will allow to get rid of almost all attribute lookups in dicts. Combined with the use of __slots__ in all classes, no dict lookups will be used at all for attributes at all, except in "getattr" calls. Combined with an mro cache, this should put Python very close to C in terms of attribute access speed (simple direct access). Eyal On 6/28/07, Giovanni Bajo <rasky@develer.com> wrote:
participants (3)
-
Brett Cannon
-
Eyal Lotem
-
Giovanni Bajo