Mailman 3 Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)? - Python-Dev

newer
PEP 677 (Callable Type Syntax):...

Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

older
Changing PySequence and PyMapping...

Mark Dickinson

Dec. 30, 2021

12:41 p.m.

tl;dr: I'd like to deprecate and eventually remove the option to use 15-bit digits in the PyLong implementation. Before doing so, I'd like to find out whether there's anyone still using 15-bit PyLong digits, and if so, why they're doing so. History: the use of 30-bit digits in PyLong was introduced for Python 3.1 and Python 2.7, to improve performance of int (Python 3) / long (Python 2) arithmetic. At that time, we retained the option to use 15-bit digits, for two reasons: - (1) use of 30-bit digits required C99 features (uint64_t and friends) at a time when we hadn't yet committed to requiring C99 - (2) it wasn't clear whether 30-bit digits would be a performance win on 32-bit operating systems Twelve years later, reason (1) no longer applies, and I suspect that: - No-one is deliberately using the 15-bit digit option. - There are few machines where using 15-bit digits is faster than using 30-bit digits. But I don't have solid data on either of these suspicions, hence this post. Removing the 15-bit digit option would simplify the code (there's significant mental effort required to ensure we don't break things for 15-bit builds when modifying Objects/longobject.c, and 15-bit builds don't appear to be exercised by the buildbots), remove a hidden compatibility trap (see b.p.o. issue 35037), widen the applicability of the various fast paths for arithmetic operations, and allow for some minor fast-path small-integer optimisations based on the fact that we'd be able to assume that presence of *two* extra bits in the C integer type rather than just one. As an example of the latter: if `a` and `b` are PyLongs that fit in a single digit, then with 15-bit digits and a 16-bit `digit` and `sdigit` type, `a + b` can't currently safely (i.e., without undefined behaviour from overflow) be computed with the C type `sdigit`. With 30-bit digits and a 32-bit `digit` and `sdigit` type, `a + b` is safe. Mark *References* Related b.p.o. issue: https://bugs.python.org/issue45569 MinGW compatibility issue: https://bugs.python.org/issue35037 Introduction of 30-bit digits: https://bugs.python.org/issue4258

Show replies by date

Gregory P. Smith

December 2021

8:42 p.m.

On Thu, Dec 30, 2021 at 4:47 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...

tl;dr: I'd like to deprecate and eventually remove the option to use 15-bit digits in the PyLong implementation. Before doing so, I'd like to find out whether there's anyone still using 15-bit PyLong digits, and if so, why they're doing so.

History: the use of 30-bit digits in PyLong was introduced for Python 3.1 and Python 2.7, to improve performance of int (Python 3) / long (Python 2) arithmetic. At that time, we retained the option to use 15-bit digits, for two reasons:

- (1) use of 30-bit digits required C99 features (uint64_t and friends) at a time when we hadn't yet committed to requiring C99 - (2) it wasn't clear whether 30-bit digits would be a performance win on 32-bit operating systems

Twelve years later, reason (1) no longer applies, and I suspect that:

- No-one is deliberately using the 15-bit digit option. - There are few machines where using 15-bit digits is faster than using 30-bit digits.

But I don't have solid data on either of these suspicions, hence this post.

Removing the 15-bit digit option would simplify the code (there's significant mental effort required to ensure we don't break things for 15-bit builds when modifying Objects/longobject.c, and 15-bit builds don't appear to be exercised by the buildbots), remove a hidden compatibility trap (see b.p.o. issue 35037), widen the applicability of the various fast paths for arithmetic operations, and allow for some minor fast-path small-integer optimisations based on the fact that we'd be able to assume that presence of *two* extra bits in the C integer type rather than just one. As an example of the latter: if `a` and `b` are PyLongs that fit in a single digit, then with 15-bit digits and a 16-bit `digit` and `sdigit` type, `a + b` can't currently safely (i.e., without undefined behaviour from overflow) be computed with the C type `sdigit`. With 30-bit digits and a 32-bit `digit` and `sdigit` type, `a + b` is safe.

Mark

tying the thread together: this is https://bugs.python.org/issue45569 Check 32-bit builds. When I pushed for the 30-bit digit implementation, I wanted it for all builds but if I recall correctly it *might* have changed the minimum structure size for PyLong which could've been an ABI issue? double check that. 32-bit is still important. Raspbian. rpi, rpi zero, and first rev rpi2 are 32-bit arm architectures so even with 64-bit raspbian on the horizon, that won't be the norm. and for those, memory matters so a 32-bit userspace on 64-bit capable hardware is still preferred for small pointer sizes on the majority which have <=4GiB ram. I believe performance was the other concern, 30-bit happens to perform great on 32-bit x86 as it has 32*32->64 multiply hardware. Most 32-bit architectures do not AFAIK, making 30 bit digit multiplies less efficient. And 32-bit x86 was clearly on its way out by the time we adopted 30-bit so it was simpler to just not do it on that dying snowflake of a platform. (test it on raspbian - it's the one that matters) Regardless of possible issues to work out, I'd love us to have a simpler 30-bit only implementation. Granted, modern 64-bit hardware often has 64*64->128 bit multiply hardware so you can imagine going beyond 30 and winding up in complexity land again. at least the extra bits would be >=2 at that point. The reason for digits being a multiple of 5 bits should be revisited vs its original intent and current state of the art "bignum optimized for mostly small numbers" at some point as well. -gps

...

*References*

Related b.p.o. issue: https://bugs.python.org/issue45569 MinGW compatibility issue: https://bugs.python.org/issue35037 Introduction of 30-bit digits: https://bugs.python.org/issue4258 _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZICIMX5V... Code of Conduct: http://python.org/psf/codeofconduct/

Gregory P. Smith

2:38 a.m.

On Thu, Dec 30, 2021 at 12:42 PM Gregory P. Smith <greg@krypto.org> wrote:

...

On Thu, Dec 30, 2021 at 4:47 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...
tl;dr: I'd like to deprecate and eventually remove the option to use 15-bit digits in the PyLong implementation. Before doing so, I'd like to find out whether there's anyone still using 15-bit PyLong digits, and if so, why they're doing so.

History: the use of 30-bit digits in PyLong was introduced for Python 3.1 and Python 2.7, to improve performance of int (Python 3) / long (Python 2) arithmetic. At that time, we retained the option to use 15-bit digits, for two reasons:

- (1) use of 30-bit digits required C99 features (uint64_t and friends) at a time when we hadn't yet committed to requiring C99 - (2) it wasn't clear whether 30-bit digits would be a performance win on 32-bit operating systems

Twelve years later, reason (1) no longer applies, and I suspect that:

- No-one is deliberately using the 15-bit digit option. - There are few machines where using 15-bit digits is faster than using 30-bit digits.

But I don't have solid data on either of these suspicions, hence this post.

Removing the 15-bit digit option would simplify the code (there's significant mental effort required to ensure we don't break things for 15-bit builds when modifying Objects/longobject.c, and 15-bit builds don't appear to be exercised by the buildbots), remove a hidden compatibility trap (see b.p.o. issue 35037), widen the applicability of the various fast paths for arithmetic operations, and allow for some minor fast-path small-integer optimisations based on the fact that we'd be able to assume that presence of *two* extra bits in the C integer type rather than just one. As an example of the latter: if `a` and `b` are PyLongs that fit in a single digit, then with 15-bit digits and a 16-bit `digit` and `sdigit` type, `a + b` can't currently safely (i.e., without undefined behaviour from overflow) be computed with the C type `sdigit`. With 30-bit digits and a 32-bit `digit` and `sdigit` type, `a + b` is safe.

Mark

tying the thread together: this is https://bugs.python.org/issue45569

Check 32-bit builds. When I pushed for the 30-bit digit implementation, I wanted it for all builds but if I recall correctly it *might* have changed the minimum structure size for PyLong which could've been an ABI issue? double check that. 32-bit is still important. Raspbian. rpi, rpi zero, and first rev rpi2 are 32-bit arm architectures so even with 64-bit raspbian on the horizon, that won't be the norm. and for those, memory matters so a 32-bit userspace on 64-bit capable hardware is still preferred for small pointer sizes on the majority which have <=4GiB ram.

I believe performance was the other concern, 30-bit happens to perform great on 32-bit x86 as it has 32*32->64 multiply hardware. Most 32-bit architectures do not AFAIK, making 30 bit digit multiplies less efficient. And 32-bit x86 was clearly on its way out by the time we adopted 30-bit so it was simpler to just not do it on that dying snowflake of a platform. (test it on raspbian - it's the one that matters)

Regardless of possible issues to work out, I'd love us to have a simpler 30-bit only implementation.

Granted, modern 64-bit hardware often has 64*64->128 bit multiply hardware so you can imagine going beyond 30 and winding up in complexity land again. at least the extra bits would be >=2 at that point. The reason for digits being a multiple of 5 bits should be revisited vs its original intent and current state of the art "bignum optimized for mostly small numbers" at some point as well.

-gps

Historical context of adding the 30-bit support (also driven primarily by Mark, no surprise!) in late 2008 early 2009: https://bugs.python.org/issue4258 (and https://codereview.appspot.com/14105) -gps

Tim Peters

3:28 a.m.

[Gregory P. Smith <greg@krypto.org>]

...

The reason for digits being a multiple of 5 bits should be revisited vs its original intent

I added that. The only intent was to make it easier to implement bigint exponentiation easily while viewing the exponent as being in base 32 (so as to chew up 5 bits at a time).. Since the number of digit bits _was_ 15 at the time, and it was 'obvious enough" that 30 bits would work fine when 64-bit boxes became important enough to care about, it was easy to settle on "multiple of 5" as a quick, pragmatic tradeoff. The number controls how much space and time is needed to precompute a lookup table for a "large" (many bits) power - and powers can be very large indeed in the modular exponentiation case. I doubt Python will move beyond 30-bit digits. While 64-bit support became increasingly and steadily more widespread (both SW and HW) in the 32-bit world. I see very much less enthusiasm to move beyond 64 bits. Yes, 64x64 -> 128 multiply is spreading, but that's about it. Some applications can make very good use of that, but, in the grand scheme of things, they remain niche. Whole universes of apps are getting real benefit by moving from 32 to 64 bits. That's needed just to get integers big enough to represent physical memory addresses in many consumer desktops now. But we'll likely never need to go beyond 64 bits for that. In any case, "multiple of 5" is a shallow constraint. and getting rid of it shouldn't require more than rewriting longobject.c's long_pow()'s for (j = PyLong_SHIFT - 5; j >= 0; j -= 5) { loop in a more convoluted way to cross digit boundaries when trying to suck up "the next" 5 bits. In fact, there are good reasons to make that loop more convoluted anyway, but I never got around to it (it "should" instead be aiming to find "the next" 5-bit slice that ends with a 1-bit, which would allow cutting the aforementioned precomputed table in half).

Tim Peters

3:33 a.m.

...

...
The reason for digits being a multiple of 5 bits should be revisited vs its original intent

...

I added that. The only intent was to make it easier to implement bigint exponentiation easily ...

That said, I see the comments in longintrepr.h note a stronger constraint: """ the marshal code currently expects that PyLong_SHIFT is a multiple of 15 """ But that's doubtless also shallow.

Mark Dickinson

11:18 a.m.

Thanks all! So to summarize: - 15-bit digits are still very much in use, and deprecating the option would likely be premature at this point - the main users are old 32-bit (x86), which it's difficult to care about too much, and new 32-bit (principally ARM microarchitectures), which we *do* care about So my first suspicion is just downright wrong. In particular, the decade-old logic that chooses 15-bit digits whenever SIZEOF_VOID_P < 8 is still in place (albeit with a recent modification for WebAssembly). For the second suspicion, that "There are few machines where using 15-bit digits is faster than using 30-bit digits.", we need more data. It looks as though the next step would be to run some integer-intensive benchmarks on 32-bit ARM, with both --enable-big-digits=15 and --enable-big-digits=30. If those show a win (or at least, not a significant loss) for 30-bit digits, then there's a case for at least making 30-bit digits the default, which would be a first step towards eventually dropping that support. GPS: I'm not immediately seeing the ABI issue. If you're able to dig up more information on that, I'd be interested to see it. Mark On Fri, Dec 31, 2021 at 3:33 AM Tim Peters <tim.peters@gmail.com> wrote:

...

...
...
The reason for digits being a multiple of 5 bits should be revisited vs its original intent

...
I added that. The only intent was to make it easier to implement bigint exponentiation easily ...

That said, I see the comments in longintrepr.h note a stronger constraint:

""" the marshal code currently expects that PyLong_SHIFT is a multiple of 15 """

But that's doubtless also shallow.

Skip Montanaro

12:39 p.m.

Perhaps I missed it, but maybe an action item would be to add a buildbot which configures for 15-bit PyLong digits. Skip

Mark Dickinson

3:19 p.m.

On Fri, Dec 31, 2021 at 12:40 PM Skip Montanaro <skip.montanaro@gmail.com> wrote:

...

Perhaps I missed it, but maybe an action item would be to add a buildbot which configures for 15-bit PyLong digits.

Yep, good point. I was wrong to say that "15-bit builds don't appear to be exercised by the buildbots": there's a 32-bit Gentoo buildbot that's (implicitly) using 15-bit digits, and the GitHub Actions Windows/x86 build also uses 15-bit digits. I don't think we have anything that's explicitly using the `--enable-big-digits` option, though. -- Mark

Gregory P. Smith

9 p.m.

Regarding ABI issues, I don't see anything obvious either. I was probably misremembering the potential marshal issue, which was addressed. struct _longobject (the implementation details behind the public PyLongObject typedef name) and the digit definition are excluded from Py_LIMITED_API. So per https://docs.python.org/3.10/c-api/stable.html we are free to change the struct layout. yay. regardless, I have confirmed that, sys.getsizeof(0) returns the same value (12) on a 32-bit build both with 15-bit and 30-bit (--enable-big-digits) builds on 32-bit architectures (I checked arm and x86). So it'd only "break" something depending on non-limited minor version specific ob_digit definitions and using it on the wrong Python version. not a big deal. People wanting that need to use Py_LIMITED_API in their extension code as per our existing policy. The getsizeof increments go from 12 14 16 18 20 to 0digits=12 1digit=16 2digit=20 as expected when doubling the digit size, but this isn't a problem. memory allocator wise, the same amount of ram is going to be consumed by the same magnitude int regardless of how it gets built. nothing allocates and tracks at a 2-byte granularity. Perhaps I missed it, but maybe an action item would be to add a

...

...
buildbot which configures for 15-bit PyLong digits.

Yep, good point. I was wrong to say that "15-bit builds don't appear to be exercised by the buildbots": there's a 32-bit Gentoo buildbot that's (implicitly) using 15-bit digits, and the GitHub Actions Windows/x86 build also uses 15-bit digits. I don't think we have anything that's explicitly using the `--enable-big-digits` option, though.

My raspbian bot covers the 32-bit use case we primarily care about. (I should promote that to one to stable) I suggest just going for it. Remove 15-bit digit support and clean up the code. My guess is that there will not be a meaningful performance impact on 32-bit hosts. I'm happy to run some tests on a rpi once you've got a PR up if you don't already have a dozen of those laying around. :) -gps

...

-- Mark _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZIR2UF7K... Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

January 2022

9:02 p.m.

On Fri, 31 Dec 2021 11:18:34 +0000 Mark Dickinson <dickinsm@gmail.com> wrote:

...

It looks as though the next step would be to run some integer-intensive benchmarks on 32-bit ARM, with both --enable-big-digits=15 and --enable-big-digits=30. If those show a win (or at least, not a significant loss) for 30-bit digits, then there's a case for at least making 30-bit digits the default, which would be a first step towards eventually dropping that support.

Note that ARM is merely an architecture with very diverse implementations having quite differing performance characteristics. If the concern is truly to have no performance regression, then tests on different CPU models would probably be required. Of course, we may also not be that concerned (do embedded platforms often rely on Python's bigint performance, for example?).

...

GPS: I'm not immediately seeing the ABI issue. If you're able to dig up more information on that, I'd be interested to see it.

We don't have an ABI (except the stable ABI which doesn't expose object internals), so this should be fine. Regards Antoine.

Mark Dickinson

10:35 a.m.

On Sat, Jan 1, 2022 at 9:05 PM Antoine Pitrou <antoine@python.org> wrote:

...

Note that ARM is merely an architecture with very diverse implementations having quite differing performance characteristics. [...]

Understood. I'd be happy to see timings on a Raspberry Pi 3, say. I'm not too worried about things like the RPi Pico - that seems like it would be more of a target for MicroPython than CPython. Wikipedia thinks, and the ARM architecture manuals seem to confirm, that most 32-bit ARM instruction sets _do_ support the UMULL 32-bit-by-32-bit-to-64-bit multiply instruction. (From https://en.wikipedia.org/wiki/ARM_architecture#Arithmetic_instructions: "ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores don't support 64-bit results.") Division may still be problematic. -- Mark

Gregory P. Smith

6:40 a.m.

On Sun, Jan 2, 2022 at 2:37 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...

On Sat, Jan 1, 2022 at 9:05 PM Antoine Pitrou <antoine@python.org> wrote:

...
Note that ARM is merely an architecture with very diverse implementations having quite differing performance characteristics. [...]

Understood. I'd be happy to see timings on a Raspberry Pi 3, say. I'm not too worried about things like the RPi Pico - that seems like it would be more of a target for MicroPython than CPython.

Wikipedia thinks, and the ARM architecture manuals seem to confirm, that most 32-bit ARM instruction sets _do_ support the UMULL 32-bit-by-32-bit-to-64-bit multiply instruction. (From https://en.wikipedia.org/wiki/ARM_architecture#Arithmetic_instructions: "ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores don't support 64-bit results.") Division may still be problematic.

It's rather irrelevant anyways, the pi zero/one is the lowest spec arm that matters at all. Nobody is ever going to ship something worse than that capable of running CPython. Anyways I ran actual benchmarks on a pi3. On 32-bit raspbian I build CPython 3.10 with no configure flags and with --enable-big-digits (or however that's spelled) for 30-bit digits and ran pyperformance 1.0.2 on them. Caveat: This is not a good system to run benchmarks on. widely variable performance (it has a tiny heatsink which never meaningfully got hot), and the storage is a random microsd card. Each full pyperformance run took 6 hours. :P Results basically say: no notable difference. Most do not change and the variability (look at those stddev's and how they overlap on the few things that produced a "significant" result at all) is quite high. Things wholly unrelated to integers such as the various regex benchmarks showing up as faster demonstrate the unreliability of the numbers. And also at how pointless caring about this fine level of detail for performance is on this platform. ``` pi@pi3$ pyperf compare_to 15bit.json 30bit.json 2to3: Mean +- std dev: [15bit] 7.88 sec +- 0.39 sec -> [30bit] 8.02 sec +- 0.36 sec: 1.02x slower crypto_pyaes: Mean +- std dev: [15bit] 3.22 sec +- 0.34 sec -> [30bit] 3.40 sec +- 0.22 sec: 1.06x slower fannkuch: Mean +- std dev: [15bit] 13.4 sec +- 0.5 sec -> [30bit] 13.8 sec +- 0.5 sec: 1.03x slower pickle_list: Mean +- std dev: [15bit] 74.7 us +- 22.1 us -> [30bit] 85.7 us +- 15.5 us: 1.15x slower pyflate: Mean +- std dev: [15bit] 19.6 sec +- 0.6 sec -> [30bit] 19.9 sec +- 0.6 sec: 1.01x slower regex_dna: Mean +- std dev: [15bit] 2.99 sec +- 0.24 sec -> [30bit] 2.81 sec +- 0.22 sec: 1.06x faster regex_v8: Mean +- std dev: [15bit] 520 ms +- 71 ms -> [30bit] 442 ms +- 115 ms: 1.18x faster scimark_monte_carlo: Mean +- std dev: [15bit] 3.31 sec +- 0.24 sec -> [30bit] 3.22 sec +- 0.24 sec: 1.03x faster scimark_sor: Mean +- std dev: [15bit] 6.42 sec +- 0.34 sec -> [30bit] 6.27 sec +- 0.33 sec: 1.03x faster spectral_norm: Mean +- std dev: [15bit] 4.85 sec +- 0.31 sec -> [30bit] 4.74 sec +- 0.20 sec: 1.02x faster unpack_sequence: Mean +- std dev: [15bit] 1.42 us +- 0.42 us -> [30bit] 1.60 us +- 0.33 us: 1.13x slower Benchmark hidden because not significant (47): chameleon, chaos, deltablue, django_template, dulwich_log, float, go, hexiom, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process ``` rerunning a mere few of those in --rigorous mode for more runs does not significantly improve the stddev so I'm not going to let that finish. my recommendation: proceed with removing 15-bit bignum digit support. 30-bit only future with simpler better code here we come. -gps

...

-- Mark

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/F53IZRZP... Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

8:52 a.m.

On Mon, 3 Jan 2022 22:40:25 -0800 "Gregory P. Smith" <greg@krypto.org> wrote:

...

rerunning a mere few of those in --rigorous mode for more runs does not significantly improve the stddev so I'm not going to let that finish.

The one benchmark that is bigint-heavy is pidigits AFAIK, so you might want to re-run that one if you want a more rigorous confirmation that there is no regression.

...

my recommendation: proceed with removing 15-bit bignum digit support. 30-bit only future with simpler better code here we come.

Sounds reasonable to me as well. Regards Antoine.

Tal Einat

11:17 a.m.

I have a spare RPi zero that I could try to set up as a buildbot. Would that be useful? On Tue, Jan 4, 2022 at 10:59 AM Antoine Pitrou <antoine@python.org> wrote:

...

On Mon, 3 Jan 2022 22:40:25 -0800 "Gregory P. Smith" <greg@krypto.org> wrote:

...
rerunning a mere few of those in --rigorous mode for more runs does not significantly improve the stddev so I'm not going to let that finish.

The one benchmark that is bigint-heavy is pidigits AFAIK, so you might want to re-run that one if you want a more rigorous confirmation that there is no regression.

...
my recommendation: proceed with removing 15-bit bignum digit support. 30-bit only future with simpler better code here we come.

Sounds reasonable to me as well.

Regards

Antoine.

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MRE5HF3X... Code of Conduct: http://python.org/psf/codeofconduct/

Gregory P. Smith

7:24 p.m.

On Tue, Jan 4, 2022 at 3:24 AM Tal Einat <taleinat@gmail.com> wrote:

...

I have a spare RPi zero that I could try to set up as a buildbot. Would that be useful?

No need. We've already got a 32-bit raspbian bot, adding another wouldn't add value. The rpi0/1/2 are too slow to compile on anyways. -gps On Tue, Jan 4, 2022 at 10:59 AM Antoine Pitrou <antoine@python.org> wrote:

...

...
On Mon, 3 Jan 2022 22:40:25 -0800 "Gregory P. Smith" <greg@krypto.org> wrote:

...
rerunning a mere few of those in --rigorous mode for more runs does not significantly improve the stddev so I'm not going to let that finish.

The one benchmark that is bigint-heavy is pidigits AFAIK, so you might want to re-run that one if you want a more rigorous confirmation that there is no regression.

...
my recommendation: proceed with removing 15-bit bignum digit support. 30-bit only future with simpler better code here we come.

Sounds reasonable to me as well.

Regards

Antoine.

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MRE5HF3X... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FPYXF33E... Code of Conduct: http://python.org/psf/codeofconduct/

Guido van Rossum

6:19 p.m.

Perhaps relevant for perspective: We did some review of the pyperformance benchmarks based on how noisy they are: https://github.com/faster-cpython/ideas/discussions/142 Note that pidigits is the noisiest -- its performance changes up to 11% for no good reason. The regex bms are also very noisy. On Mon, Jan 3, 2022 at 10:44 PM Gregory P. Smith <greg@krypto.org> wrote:

...

On Sun, Jan 2, 2022 at 2:37 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...
On Sat, Jan 1, 2022 at 9:05 PM Antoine Pitrou <antoine@python.org> wrote:

...
Note that ARM is merely an architecture with very diverse implementations having quite differing performance characteristics. [...]

Understood. I'd be happy to see timings on a Raspberry Pi 3, say. I'm not too worried about things like the RPi Pico - that seems like it would be more of a target for MicroPython than CPython.

Wikipedia thinks, and the ARM architecture manuals seem to confirm, that most 32-bit ARM instruction sets _do_ support the UMULL 32-bit-by-32-bit-to-64-bit multiply instruction. (From https://en.wikipedia.org/wiki/ARM_architecture#Arithmetic_instructions: "ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores don't support 64-bit results.") Division may still be problematic.

It's rather irrelevant anyways, the pi zero/one is the lowest spec arm that matters at all. Nobody is ever going to ship something worse than that capable of running CPython.

Anyways I ran actual benchmarks on a pi3. On 32-bit raspbian I build CPython 3.10 with no configure flags and with --enable-big-digits (or however that's spelled) for 30-bit digits and ran pyperformance 1.0.2 on them.

Caveat: This is not a good system to run benchmarks on. widely variable performance (it has a tiny heatsink which never meaningfully got hot), and the storage is a random microsd card. Each full pyperformance run took 6 hours. :P

Results basically say: no notable difference. Most do not change and the variability (look at those stddev's and how they overlap on the few things that produced a "significant" result at all) is quite high. Things wholly unrelated to integers such as the various regex benchmarks showing up as faster demonstrate the unreliability of the numbers. And also at how pointless caring about this fine level of detail for performance is on this platform.

``` pi@pi3$ pyperf compare_to 15bit.json 30bit.json 2to3: Mean +- std dev: [15bit] 7.88 sec +- 0.39 sec -> [30bit] 8.02 sec +- 0.36 sec: 1.02x slower crypto_pyaes: Mean +- std dev: [15bit] 3.22 sec +- 0.34 sec -> [30bit] 3.40 sec +- 0.22 sec: 1.06x slower fannkuch: Mean +- std dev: [15bit] 13.4 sec +- 0.5 sec -> [30bit] 13.8 sec +- 0.5 sec: 1.03x slower pickle_list: Mean +- std dev: [15bit] 74.7 us +- 22.1 us -> [30bit] 85.7 us +- 15.5 us: 1.15x slower pyflate: Mean +- std dev: [15bit] 19.6 sec +- 0.6 sec -> [30bit] 19.9 sec +- 0.6 sec: 1.01x slower regex_dna: Mean +- std dev: [15bit] 2.99 sec +- 0.24 sec -> [30bit] 2.81 sec +- 0.22 sec: 1.06x faster regex_v8: Mean +- std dev: [15bit] 520 ms +- 71 ms -> [30bit] 442 ms +- 115 ms: 1.18x faster scimark_monte_carlo: Mean +- std dev: [15bit] 3.31 sec +- 0.24 sec -> [30bit] 3.22 sec +- 0.24 sec: 1.03x faster scimark_sor: Mean +- std dev: [15bit] 6.42 sec +- 0.34 sec -> [30bit] 6.27 sec +- 0.33 sec: 1.03x faster spectral_norm: Mean +- std dev: [15bit] 4.85 sec +- 0.31 sec -> [30bit] 4.74 sec +- 0.20 sec: 1.02x faster unpack_sequence: Mean +- std dev: [15bit] 1.42 us +- 0.42 us -> [30bit] 1.60 us +- 0.33 us: 1.13x slower

Benchmark hidden because not significant (47): chameleon, chaos, deltablue, django_template, dulwich_log, float, go, hexiom, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process ```

rerunning a mere few of those in --rigorous mode for more runs does not significantly improve the stddev so I'm not going to let that finish.

my recommendation: proceed with removing 15-bit bignum digit support. 30-bit only future with simpler better code here we come.

-gps

...
-- Mark

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/F53IZRZP... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5RJGI6TH... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Mark Dickinson

5:46 p.m.

On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...

Division may still be problematic.

On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86. I was curious to find out what the potential impact of the failure to use DIVL was, so I ran some timings. A worst-case target is division of a large (multi-digit) integer by a single-digit integer (where "digit" means digit in the sense of PyLong digit, not decimal digit), since that involves multiple CPU division instructions in a fairly tight loop. Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6, non-optimised non-debug Python build), a single division of 10**1000 by 10 takes ~1018ns on the current main branch and ~722ns when forced to use the DIVL instruction (by inserting inline assembly into the inplace_divrem1 function). IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings. For anyone who wants to play along, here's the implementation of the inplace_divrem1 (in longobject.c) that I was using: static digit inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n) { digit remainder = 0; assert(n > 0 && n <= PyLong_MASK); while (--size >= 0) { twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size]; digit quotient, high, low; high = (digit)(dividend >> 32); low = (digit)dividend; __asm__("divl %2\n" : "=a" (quotient), "=d" (remainder) : "r" (n), "a" (low), "d" (high) ); pout[size] = quotient; } return remainder; } I don't know whether we *really* want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect. -- Mark

Tim Peters

7:38 p.m.

]Mark Dickinson <dickinsm@gmail.com>]

...

...
Division may still be problematic.

Heh. I'm half convinced that heavy duty bigint packages are so often written in assembler because their authors are driven insane by trying to trick C compilers into generating "the obvious" machine instructions needed. An alternative to HW division is to multiply by a scaled integer reciprocal instead, but it's very delicate to get all cases right. GMP heavyweights Niels Moller and Torbjorn Granlund wrote up the most efficient I've seen of this ilk in their paper "Improved division by invariant integers". It requires a single x single -> double multiply, and another but only the low bits from that one, to get the right quotient and remainder. Ironically, it would run faster if CPython used all 32 bits of its internal digits - some operations in their algorithm are expected to overflow at times (including + and -!), andi it's crucial that they be done modulo the base in use. That happens "by magic" if the base matches the unsigned type's width. They only need a single-width scaled reciprocal, which can be computed with a HW double / single -> single divide if available, or via excruciating division-free scaled integer Newton refinement. Of course the cost of computing the reciprocal once pays off each time the same denominator is used. Curiously, """ It is curious that on these processors, the combination of our reciprocal algorithm (Alg. 2) and division algorithm (Alg. 4) is significantly faster than the built in assembler instruction for 2/1 division. """ HW division may have gotten faster since then, though.

...

On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder).

Which is an instance of what they mean by "2/1 division" above.

...

... IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings.

...

... I don't know whether we really want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect.

Indeed it is! But changing the problem to use multiplication instead may be both faster and more portable, albeit at the cost of subtler code.

Gregory P. Smith

12:51 a.m.

On Fri, Jan 14, 2022 at 9:50 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...

On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dickinsm@gmail.com> wrote:

...
Division may still be problematic.

On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.

I was curious to find out what the potential impact of the failure to use DIVL was, so I ran some timings. A worst-case target is division of a large (multi-digit) integer by a single-digit integer (where "digit" means digit in the sense of PyLong digit, not decimal digit), since that involves multiple CPU division instructions in a fairly tight loop.

Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6, non-optimised non-debug Python build), a single division of 10**1000 by 10 takes ~1018ns on the current main branch and ~722ns when forced to use the DIVL instruction (by inserting inline assembly into the inplace_divrem1 function). IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings.

For anyone who wants to play along, here's the implementation of the inplace_divrem1 (in longobject.c) that I was using:

static digit inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n) { digit remainder = 0;

assert(n > 0 && n <= PyLong_MASK); while (--size >= 0) { twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size]; digit quotient, high, low; high = (digit)(dividend >> 32); low = (digit)dividend; __asm__("divl %2\n" : "=a" (quotient), "=d" (remainder) : "r" (n), "a" (low), "d" (high) ); pout[size] = quotient; } return remainder; }

I don't know whether we *really* want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect.

That only appears true in default boring -O2 builds. Use `./configure --enable-optimizations` and the C version is *much* faster than your asm one... 250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3 when compiled using --enable-optimizations. tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//10' ` Use of __asm__ appears to have prevented the compiler from being able to fully optimize that code in PGO/FDO mode. I trust the compiler toolchain to know what's close to best here. Optimizing use of a div instruction isn't entirely straight forward as on many microarchitectures the time required varies based on the inputs as they'll internally implement looping when values exceed the bits with which their hw operates. there's probably interesting ways to optimize bignum division using opencl and vector hardware as well - for the case when you know you've got over a dozen digits; but that's what numpy et. al. are for. Bignums in python are a convenience. Most values normal code deals with are less than 2**100. -Greg

...

-- Mark

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZWGPO3TM... Code of Conduct: http://python.org/psf/codeofconduct/

Tim Peters

8:12 p.m.

[Gregory P. Smith <greg@krypto.org>]

...

... That only appears true in default boring -O2 builds. Use `./configure --enable-optimizations` and the C version is *much* faster than your asm one...

250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3 when compiled using --enable-optimizations.

Something is missing here, but can't guess what without seeing the generated machine code.But I trust Mark will do that.

...

tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//10' `

Use of __asm__ appears to have prevented the compiler from being able to fully optimize that code in PGO/FDO mode.

I trust the compiler toolchain to know what's close to best here.

Mark is extremely experienced in numeric programming, and is the most capable numerical analyst contributing to CPython. That's why he trusts nothing ;-)

...

... there's probably interesting ways to optimize bignum division using opencl and vector hardware as well - for the case when you know you've got over a dozen digits; but that's what numpy et. al. are for.

numpy has no support for bigints, beyond allowing the use of PyObject* in numpy arrays (dtype='object') .In which case _all_ the bigint math is performed by the hosting Python implementation. People dead serious about bigint speed use the GMP library, which is a massive amount of code and 100% devoted to peak speed. In Python, gmpy2 supplies Python bindings for the GMP/MPIR, MPFR, and MPC libraries.

...

Bignums in python are a convenience. Most values normal code deals with are less than 2**100.

There's very little that "most normal code" uses. Not bigints, not HTML, not regexps, not sockets, on & on & on..Whichever bubbles you live in are nevertheless bubbles ;-) The speed of + - * // impose fundamental limits on the speed of almost all bigint algorithms. A reasonably quick win for bigint // would be most welcome. Anyone dead serious about working with bigints in Python "should" install gmpy2. But it's still fun to prototype work in plain old Python. Note: I added an implementation of Karatsuba multiplication to CPython two decades ago (which gives a major O() improvement over "schoolbook" multiplication). I've often said that was a mistake,which I wouldn't make again. Because it added relative mountains of new code to longobject.c, and people who need fast bigint * should _still_ look to GMP anyway (which implements several additional, even more gonzo, * algorithms). But neither do I want to remove it. It works fine, and the speed boost is welcome (in my bubbles, multi-thousand bit ints are common). What Mark is looking at here has a _very_ much more limited scope: the two (I think) places in longobject.c that are dividing two native-size machine ints in time-critical loops. "But nobody divides big ints anyway - why would they?". For example, if you're investigating crypto schemes, modular exponentiation with multi-thousand bit ints can be common, and under the covers, CPython uses the same code for both // and %. One modular exponentiation has to do a * and % for each bit position in the exponent, plus more proportional to the number of bits set in the exponent. About trusting "the toolchain", here under Win84 on a capable-enough desktop box (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s), using the released 3.10.1: $ python --version Python 3.10.1 $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10" 1500000 loops, best of 5: 376 nsec per loop Which actually makes little sense to me. 10**1000 requires 111 CPython "digits", which is the number of times the loop in `inplace_divrem1()` has to go around. Under 4 nsec per iteration seems close to impossibly fast on a 3.8GHz box, given the presence of any division instruction. However, dividing by 10 is not a worst case on this box. Dividing by 100 is over 3x slower: $ python -m timeit -n 1500000 -s "x = 10**1000" "x//100" 1500000 loops, best of 5: 1.25 usec per loop Dividing by the largest single Python "digit" is apparently the same: $ python -m timeit -n 1500000 -s "x = 10**1000; d=2**30-1" "x//d" 1500000 loops, best of 5: 1.29 usec per loop In fact, much the same story for dividing by 1, 2, 3. 4, 5, 6, 7, 8, 9, and 11. Got bored then ;-) They're _all_ much slower than dividing by 10 in this case. Trust nothing :-)

Mark Dickinson

12:08 p.m.

On Sat, Jan 15, 2022 at 8:12 PM Tim Peters <tim.peters@gmail.com> wrote:

...

Something is missing here, but can't guess what without seeing the generated machine code.But I trust Mark will do that.

Welp, there goes my weekend. :-) $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10" 1500000 loops, best of 5: 376 nsec per loop

...

Which actually makes little sense to me. [...] Under 4 nsec per iteration seems

close to impossibly fast on a 3.8GHz box, given the presence of any

...

division instruction.

<snip> However, dividing by 10 is not a worst case on this box. Dividing by

...

100 is over 3x slower:

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//100" 1500000 loops, best of 5: 1.25 usec per loop

Now *that* I certainly wasn't expecting. I don't see the same effect on macOS / Clang, whether compiling with --enable-optimizations or not; this appears to be a GCC innovation. And indeed, as Tim suggested, it turns out that there's no division instruction present in the loop for the division-by-10 case - we're doing division via multiplication by the reciprocal. In Python terms, we're computing `x // 10` as `(x * 0xcccccccccccccccd) >> 67`. Here's the tell-tale snippet of the assembly output from the second compilation (the one that makes use of the generated profile information) of longobject.c at commit 09087b8519316608b85131ee7455b664c00c38d2 <https://github.com/python/cpython/blob/09087b8519316608b85131ee7455b664c00c3...> on a Linux box, with GCC 11.2.0. I added a couple of comments, but it's otherwise unaltered .loc 1 1632 36 view .LVU12309 movl %r13d, %r11d salq $2, %rbp cmpl $10, %r13d # compare divisor 'n' with 10, and jne .L2797 # go to the slow version if n != 10 leaq 1(%r10), %r9 # from here on, the divisor is 10 addq %rbp, %r8 .LVL3442: .loc 1 1632 36 view .LVU12310 addq %rbp, %rdi .LVL3443: .loc 1 1632 36 view .LVU12311 .LBE8049: .loc 1 1624 15 view .LVU12312 xorl %r13d, %r13d .LVL3444: .loc 1 1624 15 view .LVU12313 movabsq $-3689348814741910323, %r11 # magic constant 0xcccccccccccccccd for division by 10 and then a few lines later: .loc 1 1630 9 is_stmt 1 view .LVU12316 .loc 1 1631 9 view .LVU12317 .loc 1 1631 39 is_stmt 0 view .LVU12318 movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of r14 .LVL3446: .loc 1 1632 9 is_stmt 1 view .LVU12319 movq %r14, %rax # set up for division: top digit is now in rax .loc 1 1633 13 is_stmt 0 view .LVU12320 movq %r14, %r13 mulq %r11 # here's the division by 10: multiply by the magic constant shrq $3, %rdx # and divide by 8 (via a shift) and then it all gets a bit repetitive and boring - there's a lot of loop unrolling going on. So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation. TIL ... -- Mark

Terry Reedy

4:09 p.m.

On 1/16/2022 7:08 AM, Mark Dickinson wrote:

...

Now /that/ I certainly wasn't expecting. I don't see the same effect on macOS / Clang, whether compiling with --enable-optimizations or not; this appears to be a GCC innovation. And indeed, as Tim suggested, it turns out that there's no division instruction present in the loop for the division-by-10 case - we're doing division via multiplication by the reciprocal. In Python terms, we're computing `x // 10` as `(x * 0xcccccccccccccccd) >> 67`.

https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication... and https://stackoverflow.com/questions/30790184/perform-integer-division-using-... have multiple discussions of the technique for machine division invariant (small) ints and GCC's use thereof (only suppressed with -0s?). -- Terry Jan Reedy

Mark Dickinson

4:18 p.m.

On Sun, Jan 16, 2022 at 4:11 PM Terry Reedy <tjreedy@udel.edu> wrote:

...

https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication...

and

https://stackoverflow.com/questions/30790184/perform-integer-division-using-...

have multiple discussions of the technique for machine division invariant (small) ints and GCC's use thereof (only suppressed with -0s?).

Yes, it's an old and well-known technique, and compilers have been using it for division by a known-at-compile-time constant for many decades. What's surprising here is the use by GCC in a situation where the divisor is *not* known at compile time - that GCC essentially guesses that a divisor of 10 is common enough to justify special-casing. There's also the libdivide library[1], which caters to situations where you have a divisor not known at compile time but you know you're going to be using it often enough to compensate for the cost of computing the magic multiplier dynamically at run time. [1] https://libdivide.com -- Mark

Mark Dickinson

8:29 p.m.

On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson <dickinsm@gmail.com> wrote:

...

So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

Nope, that's not what's happening. This analysis is backwards, and unfairly attributes to GCC the apparently arbitrary choice to optimise division by 10. But it's not GCC's fault; it's ours. What's *actually* happening is that GCC is simply recording values for n used in calls to divrem1 (via the -fprofile-values option, which is implied by -fprofile-generate, which is used as a result of the --enable-optimizations configure script option). It's then noticing that in our profile task (which consists of a selection of Lib/test/test_*.py test files) we most often do divisions by 10, and so it optimizes that case. To test this hypothesis I added a large number of tests for division by 17 in test_long.py, and then recompiled from scratch (again with --enable-optimizations). Here are the results: root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=10" "x//y" 1000000 loops, best of 5: 1.14 usec per loop root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=17" "x//y" 1000000 loops, best of 5: 306 nsec per loop root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=1" "x//y" 1000000 loops, best of 5: 1.14 usec per loop root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=2" "x//y" 1000000 loops, best of 5: 1.15 usec per loop As expected, division by 17 is now optimised; division by 10 is as slow as division by other small scalars. -- Mark

Guido van Rossum

9:28 p.m.

As a bystander, this is all fascinating (I had actually anticipated that the //10 optimization came from PGO). Does the optimization for //10 actually help in the real world? It would if people did a lot of manual conversion to decimal, which is easiest expressed using //10. But presumably for that people mostly end up using str() or repr(), which has its own custom code, long_to_decimal_string_internal(). Then again I'm not sure what's *lost* even if this optimization is pointless -- surely it doesn't slow other divisions down enough to be measurable. On Sun, Jan 16, 2022 at 12:35 PM Mark Dickinson <dickinsm@gmail.com> wrote:

...

On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson <dickinsm@gmail.com> wrote:

...
So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

Nope, that's not what's happening. This analysis is backwards, and unfairly attributes to GCC the apparently arbitrary choice to optimise division by 10. But it's not GCC's fault; it's ours. What's *actually* happening is that GCC is simply recording values for n used in calls to divrem1 (via the -fprofile-values option, which is implied by -fprofile-generate, which is used as a result of the --enable-optimizations configure script option). It's then noticing that in our profile task (which consists of a selection of Lib/test/test_*.py test files) we most often do divisions by 10, and so it optimizes that case.

To test this hypothesis I added a large number of tests for division by 17 in test_long.py, and then recompiled from scratch (again with --enable-optimizations). Here are the results:

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=10" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=17" "x//y"

1000000 loops, best of 5: 306 nsec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=1" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=2" "x//y"

1000000 loops, best of 5: 1.15 usec per loop

As expected, division by 17 is now optimised; division by 10 is as slow as division by other small scalars.

-- Mark

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2MOQCVME... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Mark Dickinson

9:43 p.m.

On Sun, Jan 16, 2022 at 9:28 PM Guido van Rossum <guido@python.org> wrote:

...

Does the optimization for //10 actually help in the real world? [...]

Yep, I don't know. If 10 is *not* the most common small divisor in real world code, it must at least rank in the top five. I might hazard a guess that division by 2 would be more common, but I've no idea how one would go about establishing that. The reason that the divisor of 10 is turning up from the PGO isn't a particularly convincing one - it looks as though it's a result of our testing the builtin int-to-decimal-string conversion by comparing with an obviously-correct repeated-division-by-10 algorithm. Then again I'm not sure what's *lost* even if this optimization is

...

pointless -- surely it doesn't slow other divisions down enough to be measurable.

Agreed. That at least is testable. I can run some timings (but not tonight). -- Mark

Gregory P. Smith

12:06 a.m.

On Sun, Jan 16, 2022 at 1:51 PM Mark Dickinson <dickinsm@gmail.com> wrote:

...

On Sun, Jan 16, 2022 at 9:28 PM Guido van Rossum <guido@python.org> wrote:

...
Does the optimization for //10 actually help in the real world? [...]

Yep, I don't know. If 10 is *not* the most common small divisor in real world code, it must at least rank in the top five. I might hazard a guess that division by 2 would be more common, but I've no idea how one would go about establishing that.

All of the int constants relating to time and date calculations show up frequently as well. But I'd assume -fprofile-values isn't likely to pick many to specialize on to avoid adding branches so maybe 10 is ironically it. --enable-optimizations with clang doesn't trigger value specialization (I'm pretty sure they support the concept, but I've never looked at how).

...

The reason that the divisor of 10 is turning up from the PGO isn't a particularly convincing one - it looks as though it's a result of our testing the builtin int-to-decimal-string conversion by comparing with an obviously-correct repeated-division-by-10 algorithm.

Then again I'm not sure what's *lost* even if this optimization is

...
pointless -- surely it doesn't slow other divisions down enough to be measurable.

Agreed. That at least is testable. I can run some timings (but not tonight).

BTW, I am able to convince clang 11 and higher to produce a 64:32 divide instruction with a modified version of the code. Basically just taking your assembly divl variant as an example and writing that explicitly as the operations in C: https://godbolt.org/z/63eWPczjx Taking that code and turning it into an actual test within CPython itself, it appears to deliver the desired speedup in gcc9 as well. https://github.com/python/cpython/pull/30626 for https://bugs.python.org/issue46406. 20% faster microbenchmarking with x//1 or x//17 or other non-specialized divide values. similar speedup even in --enable-optimizations builds. with both gcc9 and clang13. The compilers seem happier optimizing that code. -gps

Tim Peters

10:14 p.m.

[Tim, incidentally notes that passing 10 as the divisor to inplace_divrem1() is "impossibly fast" on Windows, consuming less than a third the time as when passing seemingly any other divisor] [Mark Dickinson, discovers much the same is true under other, but not all, Linux-y builds, due to the generated code doing a runtime(!) check for 10 and using a scaled integer reciprocal multiplication trick, instead of HW division, in that specific case] And there's a picture of this in the dictionary, next to the "dubious optimization" entry ;-) I have to believe the same is true under Visual Studio 2019, but offhand don't know how to prove that. I understand Steve uses PGO to build the python.org Windows release, but I've never done that - the "Release build" configuration I get from Github does not use PGO, and the code I get from my own "release builds" is equally slow for all divisors (and the generated assembler is obviously not trying to special-case anything). You later gave PGO the credit, which made sense from the start. But I _assumed_ PGO was latching on to // 10 because all sorts of programs use that all the time simply as a consequence of converting Python integers to strings. But that's not it, The base-conversion code doesn't call inplace_divrem1() - it does its own division loops with compile-time constants. So, as you already said ;-), it appears to be mostly an accident due to the specific tests we feed to PGO builds. More interesting to me now is the "over 3x faster" empirical observation. The paper I referenced earlier has a cheaper way to use scaled integer reciprocals, requiring nothing worse than 32x32->64 int mult, and, to compute the fudged reciprocal, nothing worse than 64x32-:>32 int divide. OTOH, libdivide already has working code ready to use, and - unlike the paper I referenced - does not require "normalizing" the divisor (shifting left to ensure it's >= base/2). libdivide needs more cycles to compute its 64-bit reciprocal, but in the context of bigints that's amortized over the number of digits in the numerator. A win is a win ;-)

Guido van Rossum

11:58 p.m.

On Sun, Jan 16, 2022 at 2:21 PM Tim Peters <tim.peters@gmail.com> wrote:

...

I have to believe the same is true under Visual Studio 2019, but offhand don't know how to prove that. I understand Steve uses PGO to build the python.org Windows release, but I've never done that - the "Release build" configuration I get from Github does not use PGO, and the code I get from my own "release builds" is equally slow for all divisors (and the generated assembler is obviously not trying to special-case anything).

I don't think there's a way to do a PGO build from Visual Studio; but a command prompt in the repo can do it using `PCbuild\build.bat --pgo`. Just be patient with it. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Tim Peters

6:35 a.m.

[Guido]

...

I don't think there's a way to do a PGO build from Visual Studio; but a command prompt in the repo can do it using `PCbuild\build.bat --pgo`. Just be patient with it.

Thanks! That worked, and was easy, and gave me an executable that runs "// 10" at supernatural speed. Alas, Visual Studio will not show a disassembly window unless the debugger is running, and there appears to be no way to convince it to run its debugger without it first recompiling the source file you're staring at. Which it recomplies without benefit of PGO. So, in the end, it was an elaborate way to reproduce the ;non-PGO optimized machine code I already saw :-)

Barry Scott

8:47 p.m.

...

On 17 Jan 2022, at 06:35, Tim Peters <tim.peters@gmail.com> wrote:

[Guido]

...
I don't think there's a way to do a PGO build from Visual Studio; but a command prompt in the repo can do it using `PCbuild\build.bat --pgo`. Just be patient with it.

Thanks! That worked, and was easy, and gave me an executable that runs "// 10" at supernatural speed.

Alas, Visual Studio will not show a disassembly window unless the debugger is running, and there appears to be no way to convince it to run its debugger without it first recompiling the source file you're staring at. Which it recomplies without benefit of PGO.

The trick you need is to close the project you use to build python from source then you can open the python.exe and run that under the debugger. Because it can find the python.pdb it will source/disasm as you want. Barry

...

So, in the end, it was an elaborate way to reproduce the ;non-PGO optimized machine code I already saw :-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D47HHYYA... Code of Conduct: http://python.org/psf/codeofconduct/

Steve Dower

9:59 p.m.

On 1/17/2022 8:47 PM, Barry Scott wrote:

...

...
On 17 Jan 2022, at 06:35, Tim Peters <tim.peters@gmail.com> wrote:

[Guido]

...
I don't think there's a way to do a PGO build from Visual Studio; but a command prompt in the repo can do it using `PCbuild\build.bat --pgo`. Just be patient with it.

Thanks! That worked, and was easy, and gave me an executable that runs "// 10" at supernatural speed.

Alas, Visual Studio will not show a disassembly window unless the debugger is running, and there appears to be no way to convince it to run its debugger without it first recompiling the source file you're staring at. Which it recomplies without benefit of PGO.

The trick you need is to close the project you use to build python from source then you can open the python.exe and run that under the debugger. Because it can find the python.pdb it will source/disasm as you want.

Or you can Debug/Attach to the process if you start it running outside of Visual Studio. You should also be able to just select the "PGUpdate" configuration (either from the dropdown in the toolbar, on Build/Configuration Manager). Provided you've done a PGO build from the command line, it will have the profile, and should be able to fairly quickly rebuild and reapply it - even with code changes - and start debugging. Or you can try and enable the preprocessed assembler output from the compiler. I'm not sure how that handles PGO though, since the assembly comes from the compiler but it's the linker that does most of the optimisation. Attaching to a running process is what I normally do. Cheers, Steve

Tim Peters

1:05 a.m.

[Barry Scott and Steve Dower share tips for convincing Visual Studio to show assembler without recompiling the file] Thanks, fellows! That mostly ;-) workedl. Problem remaining is that breakpoints just didn't work. They showed up "visually", and in the table of set breakpoints, but code went whizzing right by them every time. I didn't investigate. It's possible, e.g., that the connection between C source and the generated PGO code was so obscure that VS just gave up - or just blew it. Instead I wrote a Python loop to run a division of interest "forever". That way I hoped I'd be likely to land in the loop of interest by luck when I broke into the process. Which worked! So here's the body of the main loop: 00007FFE451D2760 mov eax,dword ptr [rcx-4] 00007FFE451D2763 lea rcx,[rcx-4] 00007FFE451D2767 shl r9,1Eh 00007FFE451D276B or r9,rax 00007FFE451D276E cmp r8,0Ah 00007FFE451D2772 jne long_div+25Bh (07FFE451D27CBh) 00007FFE451D2774 mov rax,rdi 00007FFE451D2777 mul rax,r9 00007FFE451D277A mov rax,rdx 00007FFE451D277D shr rax,3 00007FFE451D2781 mov dword ptr [r10+rcx],eax 00007FFE451D2785 mov eax,eax 00007FFE451D2787 imul rax,r8 00007FFE451D278B sub r9,rax 00007FFE451D278E sub rbx,1 00007FFE451D2792 jns long_div+1F0h (07FFE451D2760h) And above the loop is this line, which you'll recognize as loading the same scaled reciprocal of 10 as the gcc code Mark posted earlier. The code above moves %rdi into %rax before the mul instruction: 00007FFE451D2747 mov rdi,0CCCCCCCCCCCCCCCDh Note an odd decision here:the MS code compares the divisor to 10 on _every_ iteration. There are not two, "10 or not 10?", loop; bodies. Instead, if the divisor isn't 10, "jne long_div+25Bh" jumps to code not shown here, a few instructions that use hardware division, and then jump back into the tail end of the loop above to finish computing the remainder (etc). So they not only optimized division by 10, they added a useless test and two branches to every iteration of the loop when we're not dividing by 10 ;-)

1133

Age (days ago)

1152

Last active (days ago)

List overview

Download

32 comments

10 participants

participants (10)

Antoine Pitrou
Barry Scott
Gregory P. Smith
Guido van Rossum
Mark Dickinson
Skip Montanaro
Steve Dower
Tal Einat
Terry Reedy
Tim Peters

Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

tags

participants (10)