[Gregory P. Smith <greg@krypto.org>]
... That only appears true in default boring -O2 builds. Use `./configure --enable-optimizations` and the C version is *much* faster than your asm one...
250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my zen3 when compiled using --enable-optimizations.
Something is missing here, but can't guess what without seeing the generated machine code.But I trust Mark will do that.
tested using ` -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//10' `
Use of __asm__ appears to have prevented the compiler from being able to fully optimize that code in PGO/FDO mode.
I trust the compiler toolchain to know what's close to best here.
Mark is extremely experienced in numeric programming, and is the most capable numerical analyst contributing to CPython. That's why he trusts nothing ;-)
... there's probably interesting ways to optimize bignum division using opencl and vector hardware as well - for the case when you know you've got over a dozen digits; but that's what numpy et. al. are for.
numpy has no support for bigints, beyond allowing the use of PyObject* in numpy arrays (dtype='object') .In which case _all_ the bigint math is performed by the hosting Python implementation. People dead serious about bigint speed use the GMP library, which is a massive amount of code and 100% devoted to peak speed. In Python, gmpy2 supplies Python bindings for the GMP/MPIR, MPFR, and MPC libraries.
Bignums in python are a convenience. Most values normal code deals with are less than 2**100.
There's very little that "most normal code" uses. Not bigints, not HTML, not regexps, not sockets, on & on & on..Whichever bubbles you live in are nevertheless bubbles ;-) The speed of + - * // impose fundamental limits on the speed of almost all bigint algorithms. A reasonably quick win for bigint // would be most welcome. Anyone dead serious about working with bigints in Python "should" install gmpy2. But it's still fun to prototype work in plain old Python. Note: I added an implementation of Karatsuba multiplication to CPython two decades ago (which gives a major O() improvement over "schoolbook" multiplication). I've often said that was a mistake,which I wouldn't make again. Because it added relative mountains of new code to longobject.c, and people who need fast bigint * should _still_ look to GMP anyway (which implements several additional, even more gonzo, * algorithms). But neither do I want to remove it. It works fine, and the speed boost is welcome (in my bubbles, multi-thousand bit ints are common). What Mark is looking at here has a _very_ much more limited scope: the two (I think) places in longobject.c that are dividing two native-size machine ints in time-critical loops. "But nobody divides big ints anyway - why would they?". For example, if you're investigating crypto schemes, modular exponentiation with multi-thousand bit ints can be common, and under the covers, CPython uses the same code for both // and %. One modular exponentiation has to do a * and % for each bit position in the exponent, plus more proportional to the number of bits set in the exponent. About trusting "the toolchain", here under Win84 on a capable-enough desktop box (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s), using the released 3.10.1: $ python --version Python 3.10.1 $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10" 1500000 loops, best of 5: 376 nsec per loop Which actually makes little sense to me. 10**1000 requires 111 CPython "digits", which is the number of times the loop in `inplace_divrem1()` has to go around. Under 4 nsec per iteration seems close to impossibly fast on a 3.8GHz box, given the presence of any division instruction. However, dividing by 10 is not a worst case on this box. Dividing by 100 is over 3x slower: $ python -m timeit -n 1500000 -s "x = 10**1000" "x//100" 1500000 loops, best of 5: 1.25 usec per loop Dividing by the largest single Python "digit" is apparently the same: $ python -m timeit -n 1500000 -s "x = 10**1000; d=2**30-1" "x//d" 1500000 loops, best of 5: 1.29 usec per loop In fact, much the same story for dividing by 1, 2, 3. 4, 5, 6, 7, 8, 9, and 11. Got bored then ;-) They're _all_ much slower than dividing by 10 in this case. Trust nothing :-)