[pypy-dev] Optimize constant float division by multiplication?

Wed Nov 5 19:07:54 CET 2014

Hello,

I discovered that PyPy's JIT generates "DIVSD" instructions on xmm
registers when dividing a float by a constant C. This consumes an order
of magnitude more CPU cycles than the corresponding "MULSD" instruction
with a precomputed 1/C.

I know that only powers of two have an exact reciprocal floating point
representation, but there might be a benefit in trading the least
significant digit for a more significant speedup.

So, is this a missed optimization (at least for reasonably accurate
cases), a present or possibly future option (like -ffast-math in gcc) or
are there more reasons against it?

Thanks,

Toni

--- PS: Small Example ---

This function takes on average 0.41 seconds to compute on an
array.array('d') with 10**8 elements between 0 and 1:

    def spikes_div(data, threshold=1.99):
        count = 0
        for i in data:
            if i / 0.5 > threshold:
                count += 1
        return count

Rewritten with a multiplication it takes about 0.29 seconds on average,
speeding it up by factor 1.4:

	...
            if i * 2.0 > threshold:
	...

The traces contain the same instructions (except for the MULSD/DIVSD)
and run the same number of times. I'm working with a fresh translation
of the current PyPy default on Ubuntu 14.04 x64 with a 2nd generation
Core i7 CPU.