[Python-Dev] Optionally using GMP to implement long if available
tim.peters at gmail.com
Mon Nov 10 05:11:23 CET 2008
>> Whenever two digits are multiplied, the code intends to
>> cast (at least) one of them to "twodigits" first (if you
>> ever see a spot where this doesn't happen, it's a bug).
> There are indeed one or two spots that seem to be missing a
> cast, for example the line "rem -= hi*n" in inplace_divrem1.
Definitely a bug! Alas, it's not surprising -- there are no platforms
on which this bug has a visible consequence (because `digit` is
currently type `unsigned short`, C coerces `hi` and `n` to `int`
before multiplying, and on all platforms to date a C int is at least
32 bits, so that the multiply is at least 32x32->32 despite the lack
of a `twodigts` cast).
>> 3. There's no way to know exactly what compilers will do with
>> this short of staring at generated code.
> Yes---I've done the staring for gcc, so I now have precisely *one*
> data point, which is that various flavours of gcc on x86/x86_64
> *are* clever enough to turn
> (uint64_t)my_uint32 * my_other_uint32
> into the appropriate HW instruction.
Nice! Is this a documented feature? "The problem" is that it
probably depends on a combination of gcc version and compilation
flags, and because /knowing/ whether it works requires staring at
generated code, there's probably no sane way to automate detection of
when, if, and under what conditions it breaks. "Serious" packages use
assembler to avoid all this uncertainty.
> Unfortunately I don't have easy access to other compilers or
> platforms right now. :-(
Sorry, neither do I. If you can dream up a way to automate testing
for generation of the desired HW instructions, you could post the test
program here and I bet a lot of people would run it. Maybe even if
you just described what to look for "by eyeball" in the generated
> Am still working on the benchmarking, but I'm definitely seeing
> speedup on gcc/x86---between 10% and 100% depending
> on the operations.
Sure -- it's not surprising that HW crunching more bits at a time is
>> FYI, in a previous life working in speech recognition, under
>> Microsoft's Visual Studio 6 the only way we found to get at the
>> Pentium's 32x32->64 HW ability efficiently was via using inline
> Urk. We definitely don't want to go there. Though I guess this
> is how packages like gmp and GP/Pari manage.
1. I have no idea what versions of Microsoft's compiler
after MSVC 6 do here; perhaps it's no longer an issue
(the Windows Python distro no longer uses MSVC 6).
2. If this is thought to be worth having, then on very
widely used platforms I think a good case /can/ be
made for incorporating some assembler.
3. GMP is "speed at any cost" -- they use assembler
even when it's a stupid idea ;-)
> But maybe it's possible to write portable code (by providing fallbacks)
> that turns out to be efficient on the majority of mainstream systems?
If "it works" under the gcc and Windows compilers du jour on x86
systems, that probably covers over 90% of Python installations. Good
enough -- stop before it gets pointlessly hard ;-)
> The extent of the ifdef'ery in the patch is really rather small: one
> (compound) #ifdef in longintrepr.h for defining digit, twodigits, stwodigits
> etc, and a couple more for the places where digits are read and written
> in marshal.c.
But so far it only works with an unknown subset of gcc versions,
right? These things don't get simpler, alas :-(
>>> I agree that very-long-integer optimizations probably don't really belong in
>> Depends in part on whether Python can attract as many obsessed
>> maintainers and porters for such gonzo algorithms as GMP attracts ;-)
> Well, you can count me in. :)
Last I looked (which was at least 3 years ago), GMP's source code was
bigger than all of Python's combined. For a start, I'll have the PSF
draw up a contract obligating you to lifetime servitude :-)
More information about the Python-Dev