]Mark Dickinson <dickinsm@gmail.com>]
Division may still be problematic.
Heh. I'm half convinced that heavy duty bigint packages are so often written in assembler because their authors are driven insane by trying to trick C compilers into generating "the obvious" machine instructions needed. An alternative to HW division is to multiply by a scaled integer reciprocal instead, but it's very delicate to get all cases right. GMP heavyweights Niels Moller and Torbjorn Granlund wrote up the most efficient I've seen of this ilk in their paper "Improved division by invariant integers". It requires a single x single -> double multiply, and another but only the low bits from that one, to get the right quotient and remainder. Ironically, it would run faster if CPython used all 32 bits of its internal digits - some operations in their algorithm are expected to overflow at times (including + and -!), andi it's crucial that they be done modulo the base in use. That happens "by magic" if the base matches the unsigned type's width. They only need a single-width scaled reciprocal, which can be computed with a HW double / single -> single divide if available, or via excruciating division-free scaled integer Newton refinement. Of course the cost of computing the reciprocal once pays off each time the same denominator is used. Curiously, """ It is curious that on these processors, the combination of our reciprocal algorithm (Alg. 2) and division algorithm (Alg. 4) is significantly faster than the built in assembler instruction for 2/1 division. """ HW division may have gotten faster since then, though.
On that note: Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder).
Which is an instance of what they mean by "2/1 division" above.
... IOW, forcing use of DIVL instead of DIVQ, in combination with getting the remainder directly from the DIV instruction instead of computing it separately, gives a 41% speedup in this particular worst case. I'd expect the effect to be even more marked on x86, but haven't yet done those timings.
... I don't know whether we really want to open the door to using inline assembly for performance reasons in longobject.c, but it's interesting to see the effect.
Indeed it is! But changing the problem to use multiplication instead may be both faster and more portable, albeit at the cost of subtler code.