Re: [Python-Dev] speeding up PyObject_GetItem
Nick Coghlan wrote:
Many of the routines in abstract.c check their parameters for NULL, as a sanity check, and throw a SystemError if NULL is detected. I'm tempted to suggest making these checks only when Py_DEBUG is defined, but I suspect if you wanted it that way, you would have changed it already. ;)
Assuming you really want the NULL checks in production Python, there are 5 checks for NULL even though there are only two parameters. Seems like overkill?
The main problem is that many of these methods are not only used internally, but are *also* part of the public C API made available to extension modules. We want misuse of the latter to trigger exceptions, not segfault the interpreter.
Agreed, and more importantly, I have yet to be convinced that those NULL checks introduce a measurable slowdown. Daniel, have you tried measuring the performance difference with only the NULL checks removed?
Hrvoje Niksic <hrvoje.niksic <at> avl.com> writes:
Agreed, and more importantly, I have yet to be convinced that those NULL checks introduce a measurable slowdown. Daniel, have you tried measuring the performance difference with only the NULL checks removed?
I think it would be fine to add a special case for lists (*) and dicts in PyObject_GetItem(), provided it does make a significant difference in performance for these two types. (*) only for integers, not for slices Regards Antoine.
Agreed, and more importantly, I have yet to be convinced that those NULL checks introduce a measurable slowdown. Daniel, have you tried measuring the performance difference with only the NULL checks removed?
I think it highly unlikely that there is a performance difference. These tend to branch the same way every time, so the processor's branch prediction will tend to reduce the check time to near zero.
I think it would be fine to add a special case for lists (*) and dicts in PyObject_GetItem(), provided it does make a significant difference in performance for these two types.
-1 The API confusion and clutter isn't worth the micro-optimization. Also, the checks probably do have some value in early detection of programming errors; it would be ashamed to lose them in non-debug builds. When we get bug reports that are due to problems with third-party extensions, it will be harder to know whether the issue is with the extension or with us. Raymond
Raymond Hettinger <python <at> rcn.com> writes:
-1
The API confusion and clutter isn't worth the micro-optimization.
The API wouldn't change, there would only be a short path for long-indexing of lists, exactly as there is today in 2.x's BINARY_SUBSCR implementation.
On Tue, Mar 24, 2009 at 4:30 AM, Hrvoje Niksic <hrvoje.niksic@avl.com>wrote:
Agreed, and more importantly, I have yet to be convinced that those NULL checks introduce a measurable slowdown. Daniel, have you tried measuring the performance difference with only the NULL checks removed?
I'll play around with different permutations and report back on their impact. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
On Tue, Mar 24, 2009 at 7:30 AM, Daniel Stutzbach < daniel@stutzbachenterprises.com> wrote:
On Tue, Mar 24, 2009 at 4:30 AM, Hrvoje Niksic <hrvoje.niksic@avl.com>wrote:
Agreed, and more importantly, I have yet to be convinced that those NULL checks introduce a measurable slowdown. Daniel, have you tried measuring the performance difference with only the NULL checks removed?
I'll play around with different permutations and report back on their impact.
After playing around with it, I see that I was barking up the wrong tree. That's what I get for tracing through code visually instead of using a debugger. PyList implements tp_as_mapping, so it's mp_subscript method is called and the PySequence line is never used. I tried many variations of special casing at various points in the path and removing extra checks. It looks like function calls are the biggest expense. Calls to functions within the same module appear to be cheap (presumably because gcc inlines them). 100 nanoseconds, py3k trunk: ceval -> PyObject_GetItem (object.c) -> list_subscript (listobject.c) -> PyNumber_AsSsize_t (object.c) -> PyLong_AsSsize_t (longobject.c) 86 nanoseconds, by special-casing PyLong in list_subscript ceval -> PyObject_GetItem (object.c) -> list_subscript (listobject.c) -> PyLong_AsSsize_t (longobject.c) cost: could slow down mylist[some_PyNumber_that_is_not_a_long_yet_still_a_valid_index] (who cares?) 75 nanoseconds, by special-casing PyList and PyLong in PyObject and using PyList_GET_ITEM ceval -> PyObject_GetItem (object.c) -> PyLong_AsSsize_t (longobject.c) cost: could slow down not_a_list[x] 75 nanoseconds, by special-casing PyList and PyLong in ceval and using PyList_GET_ITEM ceval -> PyLong_AsSsize_t (longobject.c) cost: could slow down not_a_list[x] I had been hoping to find something general to speed up all uses of __getitem__, but that doesn't seem possible. Oh well. :-( -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
2009/3/24 Daniel Stutzbach <daniel@stutzbachenterprises.com>:
[...] 100 nanoseconds, py3k trunk: ceval -> PyObject_GetItem (object.c) -> list_subscript (listobject.c) -> PyNumber_AsSsize_t (object.c) -> PyLong_AsSsize_t (longobject.c) [more timings snipped]
Does removing the PyLong_Check call in PyLong_AsSsize_t make any noticeable difference to these timings? Mark
On Tue, Mar 24, 2009 at 10:13 AM, Mark Dickinson <dickinsm@gmail.com> wrote:
2009/3/24 Daniel Stutzbach <daniel@stutzbachenterprises.com>:
[...] 100 nanoseconds, py3k trunk: ceval -> PyObject_GetItem (object.c) -> list_subscript (listobject.c) -> PyNumber_AsSsize_t (object.c) -> PyLong_AsSsize_t (longobject.c) [more timings snipped]
Does removing the PyLong_Check call in PyLong_AsSsize_t make any noticeable difference to these timings?
Making no other changes from the trunk, removing the PyLong_Check and NULL check from PyLong_AsSsize_t shaves off 4 nanoseconds (or around 4% since the trunk is around 100 nanoseconds). Here's what I'm testing with, by the way: ./python.exe Lib/timeit.py -r 10 -s 'x = list(range(10))' 'x[5]' -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
On Tue, Mar 24, 2009 at 3:50 PM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
On Tue, Mar 24, 2009 at 10:13 AM, Mark Dickinson <dickinsm@gmail.com> wrote:
Does removing the PyLong_Check call in PyLong_AsSsize_t make any noticeable difference to these timings?
Making no other changes from the trunk, removing the PyLong_Check and NULL check from PyLong_AsSsize_t shaves off 4 nanoseconds (or around 4% since the trunk is around 100 nanoseconds).
Thanks. I'd call that a noticeable difference. I'd be +1 on changing this particular check to an assert and so disabling it in non-debug builds. I'd like to bet that the majority of calls to PyLong_AsSsize_t are internal. Mark
Mark Dickinson <dickinsm <at> gmail.com> writes:
Making no other changes from the trunk, removing the PyLong_Check and NULL check from PyLong_AsSsize_t shaves off 4 nanoseconds (or around 4% since the trunk is around 100 nanoseconds).
Thanks. I'd call that a noticeable difference.
4% on a micro-micro-benchmark is hardly compelling... cheers Antoine.
4% on a micro-micro-benchmark is hardly compelling...
I concur! This is utterly insignificant and certainly does not warrant removing the checks. -1 on these sort of fake optimizations. We should focus on algorithmic improvements and eliminating redundant work and whatnot. Removing checks that were put there for a reason doesn't seem useful at all. Raymond
On 24-Mar-09, at 3:15 PM, Raymond Hettinger wrote:
4% on a micro-micro-benchmark is hardly compelling...
I concur! This is utterly insignificant and certainly does not warrant removing the checks.
-1 on these sort of fake optimizations. We should focus on algorithmic improvements and eliminating redundant work and whatnot. Removing checks that were put there for a reason doesn't seem useful at all.
To be fair, the main proposed optimization(s) would speed up the microbenchmark by 15-25% (Daniel already stated that the NULL checks didn't have a significant impact). This seems significant, considering that tight loops whose cost is heavily due to array access are common. -Mike
4% on a micro-micro-benchmark is hardly compelling...
I concur! This is utterly insignificant and certainly does not warrant removing the checks.
-1 on these sort of fake optimizations. We should focus on algorithmic improvements and eliminating redundant work and whatnot. Removing checks that were put there for a reason doesn't seem useful at all.
To be fair, the main proposed optimization(s) would speed up the microbenchmark by 15-25% (Daniel already stated that the NULL checks didn't have a significant impact). This seems significant, considering that tight loops whose cost is heavily due to array access are common.
I thought people used PyList_GET_ITEM or something similar in those use situations. Raymond
2009/3/24 Daniel Stutzbach <daniel@stutzbachenterprises.com>:
On Tue, Mar 24, 2009 at 10:13 AM, Mark Dickinson <dickinsm@gmail.com> wrote:
2009/3/24 Daniel Stutzbach <daniel@stutzbachenterprises.com>:
[...] 100 nanoseconds, py3k trunk: ceval -> PyObject_GetItem (object.c) -> list_subscript (listobject.c) -> PyNumber_AsSsize_t (object.c) -> PyLong_AsSsize_t (longobject.c) [more timings snipped]
Does removing the PyLong_Check call in PyLong_AsSsize_t make any noticeable difference to these timings?
Making no other changes from the trunk, removing the PyLong_Check and NULL check from PyLong_AsSsize_t shaves off 4 nanoseconds (or around 4% since the trunk is around 100 nanoseconds).
Here's what I'm testing with, by the way:
./python.exe Lib/timeit.py -r 10 -s 'x = list(range(10))' 'x[5]'
What difference does it make on real applications? Are you running any macro-benchmarks against this? Collin
participants (7)
-
Antoine Pitrou
-
Collin Winter
-
Daniel Stutzbach
-
Hrvoje Niksic
-
Mark Dickinson
-
Mike Klaas
-
Raymond Hettinger