
Hi everyone, While looking over the PyLong source code in Objects/longobject.c I came across the fact that the PyLong object doesnt't include implementation for basic inplace operations such as adding or multiplication: [...] long_long, /*nb_int*/ 0, /*nb_reserved*/ long_float, /*nb_float*/ 0, /* nb_inplace_add */ 0, /* nb_inplace_subtract */ 0, /* nb_inplace_multiply */ 0, /* nb_inplace_remainder */ [...] While I understand that the immutable nature of this type of object justifies this approach, I wanted to experiment and see how much performance an inplace add would bring. My inplace add will revert to calling the default long_add function when: - the refcount of the first operand indicates that it's being shared or - that operand is one of the preallocated 'small ints' which should mitigate the effects of not conforming to the PyLong immutability specification. It also allocates a new PyLong _only_ in case of a potential overflow. The workload I used to evaluate this is a simple script that does a lot of inplace adding: import time import sys def write_progress(prev_percentage, value, limit): percentage = (100 * value) // limit if percentage != prev_percentage: sys.stdout.write("%d%%\r" % (percentage)) sys.stdout.flush() return percentage progress = -1 the_value = 0 the_increment = ((1 << 30) - 1) crt_iter = 0 total_iters = 10 ** 9 start = time.time() while crt_iter < total_iters: the_value += the_increment crt_iter += 1 progress = write_progress(progress, crt_iter, total_iters) end = time.time() print ("\n%.3fs" % (end - start)) print ("the_value: %d" % (the_value)) Running the baseline version outputs: ./python inplace.py 100% 356.633s the_value: 1073741823000000000 Running the modified version outputs: ./python inplace.py 100% 308.606s the_value: 1073741823000000000 In summary, I got a +13.47% improvement for the modified version. The CPython revision I'm using is 7f066844a79ea201a28b9555baf4bceded90484f from the master branch and I'm running on a I7 6700K CPU with Turbo-Boost disabled (frequency is pinned at 4GHz). Do you think that such an optimization would be a good approach ? Thank you, Catalin

On 8/31/2017 2:40 PM, Manciu, Catalin Gabriel wrote:
On my machine, the more realistic code, with an implicit C loop, the_value = sum(the_increment for i in range(total_iters)) gives the same value twice as fast as your explicit Python loop. (I cut total_iters down to 10**7). You might check whether sum uses an in-place accumulator for ints. -- Terry Jan Reedy

Your code is faster due to a number of reasons: - range in Python 3 is implemented in C so it's quite faster and, because your range only goes up to 10 ** 7, the fastest iterator is used: rangeiterobject for which the 'next' function is implemented using native longs instead of CPython PyLongs: rangeiter_next(rangeiterobject *r) from rangeobject.c - my code also does some extra work to output a progress indicator
The focus of this experiment was inplace adds in general. While, as you've shown, there are ways to write the loop optimally, the benchmark was written as a huge loop just to showcase that there is an improvement using this approach. The performance improvement is a result of not having to allocate/deallocate a PyLong per iteration. A huge Python program with lots of PyLong inplace operations (not just adds, this can be applied to all PyLong inplace operations), regardless of them being in a loop or not, might benefit from such an optimization. Thank you, Catalin

On Fri, Sep 1, 2017 at 9:35 AM, Manciu, Catalin Gabriel <catalin.gabriel.manciu@intel.com> wrote:
If you're writing a lot of numerical work in Python, have you tried running your code in PyPy? At very least, it's worth adding as another data point in your performance comparisons. ChrisA

On Thu, 31 Aug 2017 23:35:34 +0000 "Manciu, Catalin Gabriel" <catalin.gabriel.manciu@intel.com> wrote:
I'm skeptical there are some programs out there that are limited by the speed of PyLong inplace additions. Even if you have a bigint-intensive workload (say public-key cryptography, not that it's specifically a good idea to do so in pure Python), chances are it's spending much of its time in more sophisticated operations such as multiplication, power or division. In other words, while your experiment has intellectual and educational interest, I don't think it shows the path to a useful optimization. Regards Antoine.

On Thu, Aug 31, 2017 at 5:12 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm skeptical there are some programs out there that are limited by the speed of PyLong inplace additions.
indeed, but that could be said about any number of operations. My question is -- how can the interpreter know if it can alter what is supposed to be an immutable in-place? If it's used only internally to a function, the it would be safe, but how to know that? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Is it true that checking for refcount == 1 is enough? What if a user wrote: args = (compute_integer(), 5) # give away args to someone int.__iadd__(*args) here `args[0]` still has refcount=1 because only `args` owns this integer. On Fri, Sep 1, 2017 at 4:19 PM, Jelle Zijlstra <jelle.zijlstra@gmail.com> wrote:

On Sat, Sep 2, 2017 at 6:35 AM, Joe Jevnik via Python-Dev <python-dev@python.org> wrote:
This particular example is safe, because the arguments get passed individually - so 'args' has one reference, plus there's one more for the actual function call (what would be 'self' if it were implemented in Python). There may be other examples that are more dangerous, but my suspicion is that the nature of += with anything other than a simple name will defeat the optimization, since the owning collection will retain a reference until __iadd__ returns. ChrisA

Chris Angelico wrote:
However, that's also true when you use the += operator, so if the optimisation is to trigger at all in any useful case, the refcount threshold needs to be set higher than 1. Some experiments I did suggest that if you set it high enough for x += y to trigger it, then it will also be triggered in Joe's case. BTW, isn't there already a similar optimisation somewhere for concatenating strings? Does it still exist? How does it avoid this issue? -- Greg

On Fri, Sep 1, 2017 at 6:16 PM, Joe Jevnik via Python-Dev < python-dev@python.org> wrote:
The string concat optimization happens in the interpreter dispatch for INPLACE_ADD
In that case, isn't there a new string being build, all in one line of python? that is, the string the optimization is being created at that point, so the interpreter can know there isn't anything referencing it anywhere else. -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Sat, 2 Sep 2017 10:38:05 -0700 Chris Barker <chris.barker@noaa.gov> wrote:
You can go to the source and check yourself: https://github.com/python/cpython/blob/master/Python/ceval.c#L5226-L5277 Regards Antoine.

-CHB
You can just check the reference count of your object, it's a member of the PyObject structure which every CPython object contains: ob_refcnt. This will indicate if your object is referenced by other Python variables or by Python containers such as lists, tuples, dicts, sets etc.

On 8/31/2017 2:40 PM, Manciu, Catalin Gabriel wrote:
On my machine, the more realistic code, with an implicit C loop, the_value = sum(the_increment for i in range(total_iters)) gives the same value twice as fast as your explicit Python loop. (I cut total_iters down to 10**7). You might check whether sum uses an in-place accumulator for ints. -- Terry Jan Reedy

Your code is faster due to a number of reasons: - range in Python 3 is implemented in C so it's quite faster and, because your range only goes up to 10 ** 7, the fastest iterator is used: rangeiterobject for which the 'next' function is implemented using native longs instead of CPython PyLongs: rangeiter_next(rangeiterobject *r) from rangeobject.c - my code also does some extra work to output a progress indicator
The focus of this experiment was inplace adds in general. While, as you've shown, there are ways to write the loop optimally, the benchmark was written as a huge loop just to showcase that there is an improvement using this approach. The performance improvement is a result of not having to allocate/deallocate a PyLong per iteration. A huge Python program with lots of PyLong inplace operations (not just adds, this can be applied to all PyLong inplace operations), regardless of them being in a loop or not, might benefit from such an optimization. Thank you, Catalin

On Fri, Sep 1, 2017 at 9:35 AM, Manciu, Catalin Gabriel <catalin.gabriel.manciu@intel.com> wrote:
If you're writing a lot of numerical work in Python, have you tried running your code in PyPy? At very least, it's worth adding as another data point in your performance comparisons. ChrisA

On Thu, 31 Aug 2017 23:35:34 +0000 "Manciu, Catalin Gabriel" <catalin.gabriel.manciu@intel.com> wrote:
I'm skeptical there are some programs out there that are limited by the speed of PyLong inplace additions. Even if you have a bigint-intensive workload (say public-key cryptography, not that it's specifically a good idea to do so in pure Python), chances are it's spending much of its time in more sophisticated operations such as multiplication, power or division. In other words, while your experiment has intellectual and educational interest, I don't think it shows the path to a useful optimization. Regards Antoine.

On Thu, Aug 31, 2017 at 5:12 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm skeptical there are some programs out there that are limited by the speed of PyLong inplace additions.
indeed, but that could be said about any number of operations. My question is -- how can the interpreter know if it can alter what is supposed to be an immutable in-place? If it's used only internally to a function, the it would be safe, but how to know that? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Is it true that checking for refcount == 1 is enough? What if a user wrote: args = (compute_integer(), 5) # give away args to someone int.__iadd__(*args) here `args[0]` still has refcount=1 because only `args` owns this integer. On Fri, Sep 1, 2017 at 4:19 PM, Jelle Zijlstra <jelle.zijlstra@gmail.com> wrote:

On Sat, Sep 2, 2017 at 6:35 AM, Joe Jevnik via Python-Dev <python-dev@python.org> wrote:
This particular example is safe, because the arguments get passed individually - so 'args' has one reference, plus there's one more for the actual function call (what would be 'self' if it were implemented in Python). There may be other examples that are more dangerous, but my suspicion is that the nature of += with anything other than a simple name will defeat the optimization, since the owning collection will retain a reference until __iadd__ returns. ChrisA

Chris Angelico wrote:
However, that's also true when you use the += operator, so if the optimisation is to trigger at all in any useful case, the refcount threshold needs to be set higher than 1. Some experiments I did suggest that if you set it high enough for x += y to trigger it, then it will also be triggered in Joe's case. BTW, isn't there already a similar optimisation somewhere for concatenating strings? Does it still exist? How does it avoid this issue? -- Greg

On Fri, Sep 1, 2017 at 6:16 PM, Joe Jevnik via Python-Dev < python-dev@python.org> wrote:
The string concat optimization happens in the interpreter dispatch for INPLACE_ADD
In that case, isn't there a new string being build, all in one line of python? that is, the string the optimization is being created at that point, so the interpreter can know there isn't anything referencing it anywhere else. -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Sat, 2 Sep 2017 10:38:05 -0700 Chris Barker <chris.barker@noaa.gov> wrote:
You can go to the source and check yourself: https://github.com/python/cpython/blob/master/Python/ceval.c#L5226-L5277 Regards Antoine.

-CHB
You can just check the reference count of your object, it's a member of the PyObject structure which every CPython object contains: ob_refcnt. This will indicate if your object is referenced by other Python variables or by Python containers such as lists, tuples, dicts, sets etc.
participants (8)
-
Antoine Pitrou
-
Chris Angelico
-
Chris Barker
-
Greg Ewing
-
Jelle Zijlstra
-
Joe Jevnik
-
Manciu, Catalin Gabriel
-
Terry Reedy