On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dickinsm@gmail.com> wrote:
Division may still be problematic.

On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.

I was curious to find out what the potential impact of the failure to use DIVL was, so I ran some timings. A worst-case target is division of a large (multi-digit) integer by a single-digit integer (where "digit" means digit in the sense of PyLong digit, not decimal digit), since that involves multiple CPU division instructions in a fairly tight loop.

Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6, non-optimised non-debug Python build), a single division of 10**1000 by 10 takes ~1018ns on the current main branch and ~722ns when forced to use the DIVL instruction (by inserting inline assembly into the inplace_divrem1 function). IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings.

For anyone who wants to play along, here's the implementation of the inplace_divrem1 (in longobject.c) that I was using:

static digit
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
digit remainder = 0;

assert(n > 0 && n <= PyLong_MASK);
while (--size >= 0) {
twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
digit quotient, high, low;
high = (digit)(dividend >> 32);
low = (digit)dividend;
__asm__("divl %2\n"
: "=a" (quotient), "=d" (remainder)
: "r" (n), "a" (low), "d" (high)
pout[size] = quotient;
return remainder;

I don't know whether we really want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect.