[Patches] [ python-Patches-709744 ] CALL_ATTR opcode

Tue, 22 Apr 2003 05:20:23 -0700

Patches item #709744, was opened at 2003-03-25 18:16
Message generated for change (Comment added) made by gvanrossum
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=709744&group_id=5470

Category: Core (C code)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Thomas Wouters (twouters)
Assigned to: Nobody/Anonymous (nobody)
Summary: CALL_ATTR opcode

Initial Comment:
The result of the PyCore sprint of me and Brett: the CALL_ATTR opcode (LOAD_ATTR and CALL_FUNCTION combined) that skips the PyMethod creation and destruction for classic classes (but not newstyle classes, yet.)

The code is somewhat rough yet, it needs commenting, some renaming, and most importantly testing. It seems to work, however, and provides between a 35% and 5% speedup. (5% in 'average' code, up to 35% in instance method calls and instance creation alone.) It also needs to be updated to include newstyle classes. I will likely work on this on the flight home.

----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2003-04-22 08:20

Message:
Logged In: YES 
user_id=6380

OK, I did a more realistic benchmark: startup time of Zope3.
With more or less current CVS Python 2.3 (but not Raymond
H's bytecode optimizations), it took 3.52 seconds. With your
patch (and all .pyc files rebuilt) it took 3.47 seconds. 
That's about a percent and a half. (With Python 2.2 it took
4.08 seconds.)

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-04-21 18:48

Message:
Logged In: YES 
user_id=34209

Pystone is not likely to show much speedup, as it contains
exactly 2 instances of CALL_ATTR, only barely in the main
loop. However, it should not slow down because of CALL_ATTR
either; the two CALL_ATTRs are of the most optimized sort,
old-style instance methods, and none of the other code paths
have changed *at all* (in the fast-and-ugly mode of the
patch, which is the default.)

There are only two reasons I can think of that explain a
slower pystone: code cache and the switch statement layout.
This apparently does not influence my (somewhat
high-end-ish) test machines, but does yours (and others.)
Both are more or less outside my control. They might be
fixed by switch reordering or rearranging of the code so the
compiler optimizes it better, but that's very platform
specific and lacking a proper test-bed for that situation, I
can't do it.

Alternatively, there may be some funk with regards to
bytecode version. If bytecode wasn't properly regenerated, I
can imagine not seeing *any* speedup. Have you tried
something as simple as 

./python timeit.py -s 'class X:' -s '  def spam(self): pass'
-s 'x = X()' 'x.spam()'

? This gives a good 30% speedup on my home PC. Bytecode
problems wouldn't influence pystone though.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-04-19 14:42

Message:
Logged In: YES 
user_id=6380

Alas, like others, I see a tiny slowdown on pystone (maybe 3
percent).

This is with the default version of patch version 4 (fixed).  

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-04-18 18:48

Message:
Logged In: YES 
user_id=34209

Oops, that patch contained a C++-ism, ceval.c:3504 and 3505
needed to be swapped. Uploaded a new version.

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-04-18 18:22

Message:
Logged In: YES 
user_id=34209

Alright, here is a re-worked patch, with a toggle to choose
between a blatant copy-paste and some refactoring; see below.

The patch works by creating a new opcode, CALL_ATTR, which
is used for all <expression>.<name>(<args>) occurances. What
<expression> and <args> are, is not important, they are
compiled separately.

The CALL_ATTR opcode implementation is optimized for two
special cases: one where <expression> resulted in an
(old-style) instance, and one where <expression> resulted in
an instance of a new-style type of which the tp_getattro is
PyObject_GenericGetAttr.

The PyInstance part is done by savagely short-cutting the
usual getattr dance for instances; if it sees anything but a
PyFunction, it will fall back to a slow path. The rationale
is that if X in 'X.spam(' is an old-style class, and that
expression is not going to raise an exception, it is very
rare for 'spam' to be anything but a PyFunction. Trying to
cope with all the idiosyncracies of would slow down the
common case too much.

The PyObject_GenericGetAttr version uses a slightly modified
version of PyObject_GenericGetAttr that, when finding a
descr of the desired name, doesn't call the 'descr_get'
method but returns a status flag. The caller (call_attr)
then decides based on the type of the descr whether to call
descr_get or not. It currently only specialcases
PyFunctions. PyCFunctions, PyStaticMethods and
PyClassMethods are tricky to specialcase and/or need some of
the massaging that descr_get would do. I have not yet looked
at other callable descr's.

I had initially rewritten PyObject_GenericGetAttr() to use
the modified version, but this appears to be a significant
performance hit in normal attribute retrieval (quite common,
on newstyle classes.) Likewise, Brett and I had refactored
the call_function part of call_attr and call_function into a
separate function, but that, too, was a big hit in the
common function-calling case. Unfortunately, not doing that
refactoring means a lot of copied code, so I included both
in the patch. It may be that the slow path can be optimized
by simplyfying the refactored parts so that the compiler
understands how to inline them (e.g. the stackpointer
fudging call_function/call_callable does.) 

The default is the ugly-but-fast way, define
CALL_ATTR_SLOW_BUT_PRETTY_PATH to use the slow(er) path. The
slow(er) path is enough slower to nullify the benefit of the
patch in most of the benchmarks I ran; the fast path is only
slightly slower in some areas (probably due to cache
dynamics) but faster in every other situations, including
unexpected areas (that's not cache dynamics, of course,
that's just coder brilliance. :-)

However, finding a good benchmark is near impossible. I
added some newstyle-classes tests to PyBench, but even
normal tests were giving bizarrely negative results.
Checking those results with small scripts of timeit.py
showed entirely different results. And when pybench reported
a total 2% slowdown in the 'slow path' new code, it managed
to report that about 5% faster, consistently. timeit.py is
more consistent, and helped me determine the 'slow path' was
really slowing things down. Calling an empty method with no
arguments is about 20% faster for newstyle classes and about
30% for oldstyle classes, according to timeit.py.

Still no test for call_attr though.

I would love for people to test the code, both paths, and
give me input. I also welcome ideas on handling more
descr's, I may have missed a few unwritten rules about them.

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-04-17 20:42

Message:
Logged In: YES 
user_id=34209

Well, currently, the neutered getattr functions just bail
out and return NULL whenever they don't get what they
expect. I guess they go back to being 'full' getattr's, with
the exception that they 'return' two values: the retrieved
object, and a status indicator (to indicate found-method and
found-thing-but-not-method) Maybe the real getattr functions
should be implemented in terms of the neutered version then,
that would at least solve some maintenance issues :-)

But time enough for that tomorrow or this weekend (if the
weather doesn't keep being so terribly sunny and warm.)

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-04-17 20:29

Message:
Logged In: YES 
user_id=6380

I would think that you should avoid the double lookups
somehow...  How difficult is that?

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-04-17 19:41

Message:
Logged In: YES 
user_id=34209

Revised patch that includes avoiding the wrapping of
(Python) methods on newstyle classes as well as oldstyle
classes. The patch checks to see if a particular type uses
PyObject_GenericGetAttr, and if so, uses a near-copy of that
function to get an unwrapped PyFunction object.
PyCFunctionObject objects are not magically treated, and
fall back to the slow path... PyCFunction's descr's don't
have direct access to an 'unwrapped' version, they create
PyCFunctionObjectss on demand based on a PyCFunction -- a
function pointer.

Some simple testing (using timeit.py) suggests about a 20%
increase in speed for 'x.foo()' where 'x' is a newstyle
class instance. However, a real-world-ish benchmark (very
hard to find, for newstyle classes) suggests otherwise:
running 'timeit.py' on "pass" consistently takes about 3%
longer.

I'm certain the problem lies in the fact that the patch
doesn't consider PyCFunctions, which forces part of the slow
MRO lookup and instnace-dict checking to be re-done for
C-functions on newstyle types (of which there are a heck of
a lot.) Either handling PyMethodDescrs the same way as
PyFunctionObjects, or circumventing the slow path in some
way (by returning non-method-but-found objects) will
probably fix that.

----------------------------------------------------------------------

Comment By: Thomas Wouters (twouters)
Date: 2003-03-25 18:18

Message:
Logged In: YES 
user_id=34209

attaching patch.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=709744&group_id=5470