[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

15 Jan 2022

      On Fri, Jan 14, 2022 at 9:50 AM Mark Dickinson <dickinsm@gmail.com> wrote:
...
On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dickinsm@gmail.com> wrote:
...
Division may still be problematic.
On that note: Python divisions are somewhat crippled even on x64. Assuming
30-bit digits, the basic building block that's needed for multi-precision
division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit
quotient (and ideally also a 32-bit remainder). And there's an x86/x64
instruction that does exactly that, namely DIVL. But without using inline
assembly, current versions of GCC and Clang apparently can't be persuaded
to emit that instruction from the longobject.c source - they'll use DIVQ (a
128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set
to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.
I was curious to find out what the potential impact of the failure to use
DIVL was, so I ran some timings. A worst-case target is division of a large
(multi-digit) integer by a single-digit integer (where "digit" means digit
in the sense of PyLong digit, not decimal digit), since that involves
multiple CPU division instructions in a fairly tight loop.
Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6,
non-optimised non-debug Python build), a single division of 10**1000 by 10
takes ~1018ns on the current main branch and ~722ns when forced to use the
DIVL instruction (by inserting inline assembly into the inplace_divrem1
function). IOW, forcing use of DIVL instead of DIVQ, in combination
with getting the remainder directly from the DIV instruction instead of
computing it separately, gives a 41% speedup in this particular worst case.
I'd expect the effect to be even more marked on x86, but haven't yet done
those timings.
For anyone who wants to play along, here's the implementation of the
inplace_divrem1 (in longobject.c) that I was using:
static digit
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
{
digit remainder = 0;
assert(n > 0 && n <= PyLong_MASK);
while (--size >= 0) {
twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
digit quotient, high, low;
high = (digit)(dividend >> 32);
low = (digit)dividend;
__asm__("divl %2\n"
: "=a" (quotient), "=d" (remainder)
: "r" (n), "a" (low), "d" (high)
);
pout[size] = quotient;
}
return remainder;
}
I don't know whether we *really* want to open the door to using inline
assembly for performance reasons in longobject.c, but it's interesting to
see the effect.
That only appears true in default boring -O2 builds.  Use `./configure
--enable-optimizations` and the C version is *much* faster than your asm
one...

250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3
when compiled using --enable-optimizations.

tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r ==
10**999, r' 'x//10' `

Use of __asm__ appears to have prevented the compiler from being able to
fully optimize that code in PGO/FDO mode.

I trust the compiler toolchain to know what's close to best here.
Optimizing use of a div instruction isn't entirely straight forward as on
many microarchitectures the time required varies based on the inputs as
they'll internally implement looping when values exceed the bits with which
their hw operates.  there's probably interesting ways to optimize bignum
division using opencl and vector hardware as well - for the case when you
know you've got over a dozen digits; but that's what numpy et. al. are for.
Bignums in python are a convenience. Most values normal code deals with are
less than 2**100.

-Greg
...
--
Mark
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/ZWGPO3TM...
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?

Gregory P. Smith