[issue8692] Use divide-and-conquer for faster factorials

Wed May 12 10:50:07 CEST 2010

Mark Dickinson <dickinsm at gmail.com> added the comment:

Some quick comments:

(1) Agree about the extra bound checks:  let's treat those as a separate issue and just concentrate on performance here.

(2) log2 isn't portable:  it's not part of C89 (though it is in C99) and I don't think it exists on Windows.  Which is a shame, since it probably *does* reliably work well for boundary cases on most platforms.  I'm embarrassed to read that snippet that Alexander found, but it's true that an alternative like log(n)/log(2) has problems in boundary cases, thanks to the usual floating-point issues.  There's a bit-counting method in the int.bit_length() implementation (in Objects/longobject.c) that could possibly be re-used here.  Alternatively, if a simple for-loop to count bits doesn't have any noticeable impact on speed, then we could use that.

(3) Is the 'count set bits' code a bottleneck?  If not, then it looks fine to me as it is.  Doesn't it just get called once per factorial computation?

(4) I wonder whether the recursion in factorial_part_product slows things down;  it might be interesting to compare with an iterative version (but still one that clumps together small pieces rather than doing lots of small*big multiplies).  It seems possible that the cost of the recursive calls is insignificant compared to the cost of the various Py* calls, though.

(5) Was there a reason for using long rather than unsigned long for the computations?  Using unsigned long would give you an easily computable multiply_cutoff, and save the need for that extra static variable (it could be a macro instead).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8692>
_______________________________________