Proposal:
sys.gettickeraccumulation.__doc__ 'getgettickeraccumulation() -> current count of bytecodes processed by the interpreter.'
Target environments: Python-based systems with (many) extension modules. Motivation: A fast, easy, and non-intrusive method for estimating how much time is spent in Python code and how much time in extension modules. E.g. if numarray is used it is often difficult to estimate how much time is spent in the numarray extension and how much in application-specific Python code. Is it worth investing time reimplementing the application-specific code in C or C++? If most of the time is already spent in the numarray extension the answer will be no. Method: Determination of "micro-seconds per tick:" time.time() / sys.gettickeraccumulation() * 1.e6 A "tick" is the processing of one "Python virtual instruction." See also "setcheckinterval" here: http://www.python.org/doc/current/lib/module-sys.html A pure Python program will spent the bulk of the time interpreting bytecode. This will lead to a small value for "micro-seconds per tick." In contrast, if almost all the time is spent in extension modules, a calculation may run a long time without increasing sys.gettickeraccumulation(). For example: time ticks time/tick in micro-seconds 0.00% Python 0.008 20 405.908 10.00% Python 0.072 444680 0.163 20.00% Python 0.138 889340 0.155 30.00% Python 0.203 1334000 0.152 ... 100.00% Python 0.659 4446620 0.148 With higher resolution: time ticks time/tick in micro-seconds 0.00% Python 6.758 20 337893.748 0.10% Python 0.728 444680 1.638 0.20% Python 0.795 889340 0.894 0.30% Python 0.837 1334000 0.627 0.40% Python 0.923 1778660 0.519 0.50% Python 0.978 2223320 0.440 0.60% Python 1.091 2667980 0.409 0.70% Python 1.111 3112640 0.357 0.80% Python 1.191 3557300 0.335 0.90% Python 1.236 4001960 0.309 1.00% Python 1.315 4446620 0.296 In real applications we observed times/tick of around 10 on the same platform, indicating that significant performance increases could only be achieved through extensive recoding in a compiled language. On the other hand if we see values smaller than 1 we know that significant performance increases are achievable with a reasonable effort. Implementation: A full patch based on Python-2.4b2 is attached. The essence is this additional line in Python/ceval.c: + _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker; Note that _Py_Ticker and _Py_CheckInterval exist already in the Python distribution. The impact of the additional code on the runtime performance of a pure Python program is minute. On Xeon/Linux the factor is smaller than 1.00005. The factor is even smaller if extension modules are used. The full patch, the patched files, and a complete Python distribution including the patched files can be found here: http://cci.lbl.gov/~rwgk/python/Python-2.4b2_ticker_patch http://cci.lbl.gov/~rwgk/python/Python-2.4b2_ticker_patched_files.tar.gz http://cci.lbl.gov/~rwgk/python/Python-2.4b2_ticker.tar.gz diff -u -r Python-2.4b2/Include/ceval.h Python-2.4b2_ticker/Include/ceval.h --- Python-2.4b2/Include/ceval.h 2004-10-10 19:40:35.000000000 -0700 +++ Python-2.4b2_ticker/Include/ceval.h 2004-11-12 18:16:28.000000000 -0800 @@ -68,6 +68,11 @@ /* this used to be handled on a per-thread basis - now just two globals */ PyAPI_DATA(volatile int) _Py_Ticker; +#ifndef HAVE_LONG_LONG +PyAPI_DATA(volatile long) _Py_TickerAccumulation; +#else +PyAPI_DATA(volatile PY_LONG_LONG) _Py_TickerAccumulation; +#endif PyAPI_DATA(int) _Py_CheckInterval; /* Interface for threads. diff -u -r Python-2.4b2/Objects/longobject.c Python-2.4b2_ticker/Objects/longobject.c --- Python-2.4b2/Objects/longobject.c 2004-09-19 23:14:54.000000000 -0700 +++ Python-2.4b2_ticker/Objects/longobject.c 2004-11-12 18:20:53.000000000 -0800 @@ -38,6 +38,7 @@ #define SIGCHECK(PyTryBlock) \ if (--_Py_Ticker < 0) { \ + _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker; \ _Py_Ticker = _Py_CheckInterval; \ if (PyErr_CheckSignals()) { PyTryBlock; } \ } diff -u -r Python-2.4b2/PC/os2emx/python24.def Python-2.4b2_ticker/PC/os2emx/python24.def --- Python-2.4b2/PC/os2emx/python24.def 2004-10-10 19:40:50.000000000 -0700 +++ Python-2.4b2_ticker/PC/os2emx/python24.def 2004-11-12 18:16:47.000000000 -0800 @@ -743,6 +743,7 @@ "_Py_CheckRecursionLimit" "_Py_CheckInterval" "_Py_Ticker" + "_Py_TickerAccumulation" ; From python24_s.lib(compile) "PyCode_New" diff -u -r Python-2.4b2/Python/ceval.c Python-2.4b2_ticker/Python/ceval.c --- Python-2.4b2/Python/ceval.c 2004-10-10 19:40:50.000000000 -0700 +++ Python-2.4b2_ticker/Python/ceval.c 2004-11-12 18:22:55.000000000 -0800 @@ -373,6 +373,7 @@ pendinglast = j; _Py_Ticker = 0; + _Py_TickerAccumulation = 0; things_to_do = 1; /* Signal main loop */ busy = 0; /* XXX End critical section */ @@ -476,6 +477,11 @@ per thread, now just a pair o' globals */ int _Py_CheckInterval = 100; volatile int _Py_Ticker = 100; +#ifndef HAVE_LONG_LONG +volatile long _Py_TickerAccumulation = 0; +#else +volatile PY_LONG_LONG _Py_TickerAccumulation = 0; +#endif PyObject * PyEval_EvalCode(PyCodeObject *co, PyObject *globals, PyObject *locals) @@ -776,6 +782,7 @@ a try: finally: block uninterruptable. */ goto fast_next_opcode; } + _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker; _Py_Ticker = _Py_CheckInterval; tstate->tick_counter++; #ifdef WITH_TSC diff -u -r Python-2.4b2/Python/sysmodule.c Python-2.4b2_ticker/Python/sysmodule.c --- Python-2.4b2/Python/sysmodule.c 2004-08-12 11:19:17.000000000 -0700 +++ Python-2.4b2_ticker/Python/sysmodule.c 2004-11-12 18:51:14.000000000 -0800 @@ -442,6 +442,20 @@ "getcheckinterval() -> current check interval; see setcheckinterval()." ); +static PyObject * +sys_gettickeraccumulation(PyObject *self, PyObject *args) +{ +#ifndef HAVE_LONG_LONG + return PyInt_FromLong(_Py_TickerAccumulation + _Py_CheckInterval - _Py_Ticker); +#else + return PyLong_FromLongLong(_Py_TickerAccumulation + _Py_CheckInterval - _Py_Ticker); +#endif +} + +PyDoc_STRVAR(gettickeraccumulation_doc, +"gettickeraccumulation() -> current count of bytecodes processed by the interpreter." +); + #ifdef WITH_TSC static PyObject * sys_settscdump(PyObject *self, PyObject *args) @@ -763,6 +777,8 @@ setcheckinterval_doc}, {"getcheckinterval", sys_getcheckinterval, METH_NOARGS, getcheckinterval_doc}, + {"gettickeraccumulation", sys_gettickeraccumulation, METH_NOARGS, + gettickeraccumulation_doc}, #ifdef HAVE_DLOPEN {"setdlopenflags", sys_setdlopenflags, METH_VARARGS, setdlopenflags_doc},
Ralf W. Grosse-Kunstleve wrote:
Proposal:
sys.gettickeraccumulation.__doc__ 'getgettickeraccumulation() -> current count of bytecodes processed by the interpreter.'
This doesn't offer enough to be worth rolling into the 2.4 code.
Anthony
--
Anthony Baxter
"Ralf W. Grosse-Kunstleve"
Proposal:
sys.gettickeraccumulation.__doc__ 'getgettickeraccumulation() -> current count of bytecodes processed by the interpreter.'
You know about the 'tick_counter' field of PyThreadState? I don't think it's accessible to Python currently, but it seems to be intended for similar purposes... Cheers, mwh -- Hiro dicks about with his computer, naturally. Being stranded on a life raft in the Pacific is a perfect venue for a hacker. -- Snow Crash, Neal Stephenson
Ralf W. Grosse-Kunstleve wrote:
A pure Python program will spent the bulk of the time interpreting bytecode.
Perhaps, perhaps not. A heck of a lot of the Python core isn't written in Python - time spent executing builtins, or running methods of builtin objects usually doesn't involve the main interpreter loop (we're generally in pure C-code at that point). I'm curious how the suggested feature can provide any information that is actually useful for optimisation purposes. Just because a low proportion of time is spent in Python code, doesn't mean the Python code isn't at fault for poor performance. As an example, in CPython 2.3 and earlier, this: result = "" for x in strings: result += x is a lot worse performance-wise than: result = "".join(strings) The first version does spend more time in Python code, but the performance killer is actually in the string concatenation C code. So the time is spent in the C code, but the fault lies in the Python code (In Python 2.4, the latter version is still faster, but the difference isn't as dramatic as it used to be). Knowing "I'm spending x% of the time executing Python code" just isn't really all that interesting, unless you know *what* Python code is taking all the time. (e.g., if decimal.py spends all its time creating new instances of Decimal when running a benchmark, then you know to either speed up Decimal construction, or to look for temporary objects that can be eliminated, or both. Then you run the benchmark again to determine if the changes actually improved the situation) I'd rather encourage people to write appropriate benchmark scripts and execute them using "python -m profile <benchmark>", rather than lead them up the garden path with a global "Python/non-Python" percentage estimation utility. Cheers, Nick. -- Nick Coghlan | Brisbane, Australia Email: ncoghlan@email.com | Mobile: +61 409 573 268
participants (4)
-
Anthony Baxter
-
Michael Hudson
-
Nick Coghlan
-
Ralf W. Grosse-Kunstleve