Brett C. wrote:
> Ralf W. Grosse-Kunstleve wrote:
> > Brett C. wrote:
> >
> > >I have been mulling over this proposal and I think I am finally
> > >settling on +0 as long as it can be shown that it does not hurt
> > >performance at all (yes, it shouldn't but this is the eval loop we are
> > >talking about and that thing is a touchy beast).
> >
> >
> > Is there a standard benchmark I could use to demonstrate the
> > impact of the two new additions on the runtime?
> > Thanks!
>
> =) Parrotbench and PyBench would be enough for me to sign off on any
> performance hit. There is also pystone.
I have done my homework now, based on parrotbench. Timings are below.
The most important conclusion from my viewpoint is that my original
goal can be achieved with a small patch and with a runtime penalty that
is in the noise. However, mysterious runtime behavior of
Objects/longobject.c lead me to change my original patch, and to the
unanswered question "could small rearrangements in longobject.c
potentially boost performance by 45% on some platforms?"
Please don't be put off by the size of this message. My patch
is essentially just one line in ceval.c (_Py_TickerAccumulation++)!
To restate my original goal:
I am looking for a simple way to answer the question: How much of a
speedup can I expect if I reimplement a piece of Python code in C or
C++?
Note that I am not asking how much time is spent in Python including
all the extensions that come with it, but how much time is spent in the
Python bytecode interpreter loop. To clarify, look at the timings for
two tests in parrotbench:
b1 5.8287 time 30049474 ticks 0.1940 us/tick
b2 1.5944 time 482584 ticks 3.3039 us/tick
Each of the tests was run with range(10) in b.py to increase accuracy.
The first column shows time.time() differences, the second the number
of bytecodes processed in ceval.c, and the last column shows
micro-seconds per tick. Why is the us/tick value for b2 17 times larger
than that for b1?
The answer is that b2 makes heavy use of long integers, and that a lot
of time is spent in Objects/longobject.c doing long-integer arithmetic.
Compared to b1, relatively little time is spent interpreting bytecodes.
Back to the original question: how much sense does it make to
reimplement b1 or b2 in C? Simply looking at the us/tick values is
telling me that I can expect much more of a performance boost by
reimplementing b1 rather than b2. This is because b2 spends a lot of
time in C already. A similar situation arises if extensions like
Numeric are used. Computationally intensive operations like matrix
multiplication and matrix inversion are already done at C speed.
The us/tick let us quickly estimate how disappointed we would be after
moving Python code to C or C++. We get this estimate without having to
do detailed source code analysis, and before we have wasted our time
doing the recoding (we have been through this a couple of times ...).
The timings below show that the patched Python which counts the number
of interpreted bytecodes is in the worst case 0.6% slower than the
original Python, but some tests even run faster at face value:
$gcc296_compiled/original/python -O b_timings.py | grep reporter
b0 16.8831 time
b1 5.9553 time
b2 1.5914 time
b3 10.8149 time
b4 5.2528 time
b5 11.6437 time
b6 11.0248 time
b7 27.6960 time
all 90.9276 time
$gcc296_compiled/work/python -O b_timings.py | grep reporter
b0 16.9193 time 50979585 ticks 0.3319 us/tick
b1 5.8287 time 30049474 ticks 0.1940 us/tick
b2 1.5944 time 482584 ticks 3.3039 us/tick
b3 10.7931 time 43577404 ticks 0.2477 us/tick
b4 5.2865 time 18463044 ticks 0.2863 us/tick
b5 11.6086 time 17065750 ticks 0.6802 us/tick
b6 10.0992 time 60000464 ticks 0.1683 us/tick
b7 27.6830 time 1928644 ticks 14.3536 us/tick
all 89.8760 time 222548399 ticks 0.4038 us/tick
The b7 test is the same as b2 but with xrange(10000) instead of
xrange(1000), and with all print statements removed.
Ratios (rounded to 3 decimals):
16.9193/16.8831=1.002
5.8287/5.9553 =0.979
1.5944/1.5914 =1.002
10.7931/10.8149=0.998
5.2865/5.2528 =1.006
11.6086/11.6437=0.997
10.0992/11.0248=0.916
27.6830/27.6960=1.000
Therefore I'd argue that the runtime penalty for the one additional
long long increment in ceval.c (_Py_TickerAccumulation++) is in the
noise.
The timings were collected on a 2.8GHz Dual Xeon, Redhat WS 3.
Python was compiled under Red Hat 7.3 with gcc 2.96. See below why.
My latest patch can be found at the end of this posting.
It can be applied to the Python-2.4 distribution like this:
tar zxf Python-2.4.tgz
patch --strip=1 -d Python-2.4 < patch_file
------------------------------------------------------------------------------
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Now to the mysterious runtime behavior of Objects/longobject.c
In the original Python 2.4 sources _Py_Ticker is decremented in
two places:
In the interpreter loop in Python/ceval.c:
if (--_Py_Ticker < 0) {
In Objects/longobject.c:
#define SIGCHECK(PyTryBlock) \
if (--_Py_Ticker < 0) { \
_Py_Ticker = _Py_CheckInterval; \
if (PyErr_CheckSignals()) { PyTryBlock; } \
}
This macro is used in four places.
My initial patch was to insert updates of a new, global
_Py_TickerAccumulation variable in these two places:
Python/ceval.c:
a try: finally: block uninterruptable. */
goto fast_next_opcode;
}
+ _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker;
_Py_Ticker = _Py_CheckInterval;
tstate->tick_counter++;
Objects/longobject.c:
#define SIGCHECK(PyTryBlock) \
if (--_Py_Ticker < 0) { \
+ _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker; \
_Py_Ticker = _Py_CheckInterval; \
if (PyErr_CheckSignals()) { PyTryBlock; } \
}
This lead to the timings below. All timings were collected on the
same machine, but the original and the patched Pythons were each
compiled with two different compilers, gcc 2.96 and gcc 3.2.3. The
important timings to look at are those for b2 and b7 again.
$gcc296_compiled/original/python -O b_timings.py | grep reporter
b0 16.8831 time
b1 5.9553 time
b2 1.5914 time
b3 10.8149 time
b4 5.2528 time
b5 11.6437 time
b6 11.0248 time
b7 27.6960 time
all 90.9276 time
$gcc323_compiled/original/python -O b_timings.py | grep reporter
b0 18.6390 time
b1 6.4681 time
b2 2.2588 time
b3 11.0215 time
b4 5.6490 time
b5 12.3022 time
b6 10.3815 time
b7 40.2735 time
all 107.0636 time
This shows that the gcc 2.96 optimizer is generally a little bit
better than the gcc 3.2.3 optimizer, but not by very much. Except
for the b2 and b7 tests. E.g. 40.2735/27.6960 = 1.454!
Now with the patches in both ceval.c and longobject.c:
$gcc296_compiled/ticker/python -O b_timings.py | grep reporter
b0 17.1190 time 60792625 ticks 0.2816 us/tick
b1 6.1171 time 30049474 ticks 0.2036 us/tick
b2 1.8764 time 705754 ticks 2.6587 us/tick
b3 10.7070 time 43577404 ticks 0.2457 us/tick
b4 5.2677 time 22944694 ticks 0.2296 us/tick
b5 11.7448 time 17433190 ticks 0.6737 us/tick
b6 10.9603 time 60000504 ticks 0.1827 us/tick
b7 33.2320 time 2837124 ticks 11.7133 us/tick
all 97.0893 time 238342235 ticks 0.4074 us/tick
$gcc323_compiled/ticker/python -O b_timings.py | grep reporter
b0 18.3115 time 60792625 ticks 0.3012 us/tick
b1 6.1193 time 30049474 ticks 0.2036 us/tick
b2 2.2522 time 705754 ticks 3.1912 us/tick
b3 11.2202 time 43577404 ticks 0.2575 us/tick
b4 5.4596 time 22944694 ticks 0.2379 us/tick
b5 11.8197 time 17433190 ticks 0.6780 us/tick
b6 10.7407 time 60000504 ticks 0.1790 us/tick
b7 40.2397 time 2837124 ticks 14.1833 us/tick
all 106.2283 time 238342235 ticks 0.4457 us/tick
The gcc 3.2.3 timings for b7 are hardly affected by my patch in
longobject.c, but the gcc 2.96 timing shoots up from 27.6960 to
33.2320. Still not as bad as the gcc 3.2.3 timing, but why?
Interestingly, if I declare _Py_TickerAccumulation as long instead
of PY_LONG_LONG, the gcc 2.96 timing is also hardly affected by
my patch in longobject.c.
Even more interestingly, if I change longobject.c like this...
#define SIGCHECK(PyTryBlock) \
{ \
if (PyErr_CheckSignals()) { PyTryBlock; } \
}
the runtime is also hardly affected (times not shown in this posting)
even though PyErr_CheckSignals() is executed 100 times more often.
Curious to learn how other compilers deal with longobject.c
I collected timings with this compiler:
Compaq C V6.3-028 on Compaq Tru64 UNIX V5.1 (Rev. 732)
$compaq_compiled/original/python -O b_timings.py | grep reporter
b1 19.4492 time
b2 4.9395 time
b3 23.6787 time
b5 28.2568 time
b6 21.6230 time
b7 86.5283 time
all 192.0498 time
$compaq_compiled/ticker/python -O b_timings.py | grep reporter
b1 19.4775 time 30049474 ticks 0.6482 us/tick
b2 4.9971 time 704784 ticks 7.0902 us/tick
b3 24.2559 time 43577404 ticks 0.5566 us/tick
b5 26.3320 time 17420746 ticks 1.5115 us/tick
b6 21.7227 time 60000464 ticks 0.3620 us/tick
b7 86.9395 time 2836154 ticks 30.6540 us/tick
all 190.8936 time 165555937 ticks 1.1530 us/tick
The b0 and b4 timings are missing because of exceptions that I didn't
investigate. As with the gcc 3.2.3 compilation, my longobject.c
patch hardly makes a difference.
What does all this mean? Does the gcc 2.96 surprise tell us that small
rearrangements in longobject.c could potentially boost performance by
45% on some platforms?
Since I am not interested in optimizing longobject.c this is where I
stopped. Eventually I decided that the problematic patch in
longobject.c is not helping me with my original goal as stated near the
beginning of this posting. I am only interested in counting the
iterations of the interpreter loop. However, the _Py_Ticker decrements
in longobject.c are not inside the interpreter loop, but in C loops!
This means _Py_Ticker is useless for my purposes. Therefore I decoupled
_Py_Ticker and _Py_TickerAccumulation.
BTW: I think strictly speaking this documentation is incorrect because
of the _Py_Ticker manipulations in longobject.c:
sys.setcheckinterval.__doc__:
Tell the Python interpreter to check for asynchronous events every
n instructions.
I feel very dizzy now after dealing with all the timings and the nasty
details. A lot of fallout caused by the simple idea to add one innocent
line to ceval.c. I hope it makes sense to somebody!
Cheers,
Ralf
------------------------------------------------------------------------------
------------------------------------------------------------------------------
------------------------------------------------------------------------------
diff -r -u original/Include/ceval.h work/Include/ceval.h
--- original/Include/ceval.h 2004-10-10 19:40:35.000000000 -0700
+++ work/Include/ceval.h 2004-11-30 23:27:11.000000000 -0800
@@ -68,6 +68,11 @@
/* this used to be handled on a per-thread basis - now just two globals */
PyAPI_DATA(volatile int) _Py_Ticker;
+#ifndef HAVE_LONG_LONG
+PyAPI_DATA(volatile long) _Py_TickerAccumulation;
+#else
+PyAPI_DATA(volatile PY_LONG_LONG) _Py_TickerAccumulation;
+#endif
PyAPI_DATA(int) _Py_CheckInterval;
/* Interface for threads.
diff -r -u original/PC/os2emx/python24.def work/PC/os2emx/python24.def
--- original/PC/os2emx/python24.def 2004-10-10 19:40:50.000000000 -0700
+++ work/PC/os2emx/python24.def 2004-11-30 23:27:11.000000000 -0800
@@ -743,6 +743,7 @@
"_Py_CheckRecursionLimit"
"_Py_CheckInterval"
"_Py_Ticker"
+ "_Py_TickerAccumulation"
; From python24_s.lib(compile)
"PyCode_New"
diff -r -u original/Python/ceval.c work/Python/ceval.c
--- original/Python/ceval.c 2004-11-23 10:06:08.000000000 -0800
+++ work/Python/ceval.c 2004-12-03 19:39:36.000000000 -0800
@@ -373,6 +373,7 @@
pendinglast = j;
_Py_Ticker = 0;
+ _Py_TickerAccumulation = 0;
things_to_do = 1; /* Signal main loop */
busy = 0;
/* XXX End critical section */
@@ -476,6 +477,11 @@
per thread, now just a pair o' globals */
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 100;
+#ifndef HAVE_LONG_LONG
+volatile long _Py_TickerAccumulation = 0;
+#else
+volatile PY_LONG_LONG _Py_TickerAccumulation = 0;
+#endif
PyObject *
PyEval_EvalCode(PyCodeObject *co, PyObject *globals, PyObject *locals)
@@ -770,6 +776,7 @@
async I/O handler); see Py_AddPendingCall() and
Py_MakePendingCalls() above. */
+ _Py_TickerAccumulation++;
if (--_Py_Ticker < 0) {
if (*next_instr == SETUP_FINALLY) {
/* Make the last opcode before
diff -r -u original/Python/sysmodule.c work/Python/sysmodule.c
--- original/Python/sysmodule.c 2004-08-12 11:19:17.000000000 -0700
+++ work/Python/sysmodule.c 2004-12-03 19:36:52.000000000 -0800
@@ -442,6 +442,20 @@
"getcheckinterval() -> current check interval; see setcheckinterval()."
);
+static PyObject *
+sys_gettickeraccumulation(PyObject *self, PyObject *args)
+{
+#ifndef HAVE_LONG_LONG
+ return PyInt_FromLong(_Py_TickerAccumulation);
+#else
+ return PyLong_FromLongLong(_Py_TickerAccumulation);
+#endif
+}
+
+PyDoc_STRVAR(gettickeraccumulation_doc,
+"gettickeraccumulation() -> current count of bytecodes processed by the interpreter."
+);
+
#ifdef WITH_TSC
static PyObject *
sys_settscdump(PyObject *self, PyObject *args)
@@ -763,6 +777,8 @@
setcheckinterval_doc},
{"getcheckinterval", sys_getcheckinterval, METH_NOARGS,
getcheckinterval_doc},
+ {"gettickeraccumulation", sys_gettickeraccumulation, METH_NOARGS,
+ gettickeraccumulation_doc},
#ifdef HAVE_DLOPEN
{"setdlopenflags", sys_setdlopenflags, METH_VARARGS,
setdlopenflags_doc},