On Fri, Jan 14, 2022 at 9:50 AM Mark Dickinson <dickinsm@gmail.com> wrote:
On Sun, Jan 2, 2022 at 10:35 AM Mark Dickinson <dickinsm@gmail.com> wrote:
Division may still be problematic.
On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.
I was curious to find out what the potential impact of the failure to use DIVL was, so I ran some timings. A worst-case target is division of a large (multi-digit) integer by a single-digit integer (where "digit" means digit in the sense of PyLong digit, not decimal digit), since that involves multiple CPU division instructions in a fairly tight loop.
Results: on my laptop (2.7 GHz Intel Core i7-8559U, macOS 10.14.6, non-optimised non-debug Python build), a single division of 10**1000 by 10 takes ~1018ns on the current main branch and ~722ns when forced to use the DIVL instruction (by inserting inline assembly into the inplace_divrem1 function). IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings.
For anyone who wants to play along, here's the implementation of the inplace_divrem1 (in longobject.c) that I was using:
static digit inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n) { digit remainder = 0;
assert(n > 0 && n <= PyLong_MASK); while (--size >= 0) { twodigits dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size]; digit quotient, high, low; high = (digit)(dividend >> 32); low = (digit)dividend; __asm__("divl %2\n" : "=a" (quotient), "=d" (remainder) : "r" (n), "a" (low), "d" (high) ); pout[size] = quotient; } return remainder; }
I don't know whether we *really* want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect.
That only appears true in default boring -O2 builds. Use `./configure --enable-optimizations` and the C version is *much* faster than your asm one... 250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3 when compiled using --enable-optimizations. tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//10' ` Use of __asm__ appears to have prevented the compiler from being able to fully optimize that code in PGO/FDO mode. I trust the compiler toolchain to know what's close to best here. Optimizing use of a div instruction isn't entirely straight forward as on many microarchitectures the time required varies based on the inputs as they'll internally implement looping when values exceed the bits with which their hw operates. there's probably interesting ways to optimize bignum division using opencl and vector hardware as well - for the case when you know you've got over a dozen digits; but that's what numpy et. al. are for. Bignums in python are a convenience. Most values normal code deals with are less than 2**100. -Greg
-- Mark
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZWGPO3TM... Code of Conduct: http://python.org/psf/codeofconduct/