How fast can we multiply?

Tim Peters tim_one at email.msn.com
Sun Jul 18 20:57:16 EDT 1999


[Christian Tismer, hiding from continuations <wink>]
> ...
> Version 0.2 is now 4 times faster at 100000 bits.

Factor of 5.7 on my machine.

> ...
> fastlong.py
> ...
> It is a well known fact that by some factorization, integer
> multiplication's complexity can be brought from O(n**2) down to
> O(n**1.5).

The exponent is actually log2(3) ~= 1.58.

> The method is sometimes called Karatsuba algorithm.
> ...
> I think to remember that the GMP library uses this algorithm,
> but I didn't look it up and tried my own straightforward way.

Right, GMP does do this.  There's a much faster method (O(n)) based on
Fourier transforms, but it's much harder to code and I've only seen it used
in David Bailey's MPFUN package (a highly optimized Fortran pkg for
arbitrary-but-fixed precision float arithmetic + the std elementary
functions).

> The reason why I don't aim to modify the C implementation any
> longer is: It will not be reasonably faster in C. The real
> working horse is still the basic internal long multiplication
> (sometimes called "highschool algorithm") which is fast for small
> long integers. The only effect of a C port would be to move the
> break_even point to something smaller.

Tee hee.  I wonder how much of Python we could speed up by throwing out C
code <wink>?  Integer exponentiation can also be done faster in Python, and
much faster with a faster multiply to build on.  Seem to recall that
repr(dict) is/was quadratic-time in len(dict), while easy to make
linear-time if coded in Python.  C is so clumsy you just settle for the
simplest thing to *spell*.

> ...
> A function to obtain the bit length of a long would
> be most helpful.

The lack of this has gotten unbearable, eh?

> Some kind of split function as well.

That one is much weaker; unlike bit length, the obvious ways to split and
join already have the right O() behavior.

> ...
> (1) Aho/Hopcroft/Ullmann:
>     "The Design and Analysis of Computer Algorithms"

Knuth Vol 2 is a better reference for fast integer mult, although it goes
off the deep end on theoretical refinements of no practical value ("what if
the number of bits is bigger than the number of electrons in the
universe?").

> ...
>     def __mul__(self, other):
>         """Karatsuba alogrithm
>         (a+bX)(c+dX)=(X^2+X)(bd)+(X)(a-b)(d-c)+(X+1)(ac)
>         """
>         val1 = self.val
>         val2 = fastlong(other).val
>         if not val1 and val2:
>             return 0

The test only succeeds when val1 == 0 and val2 != 0 -- not what you
intended.  OTOH, it's debatable whether programs multiply by 0 often enough
to pay back the cost of testing for it!

>         bitcount = max(nbits(val1), nbits(val2))

min is a better choice here; see below.

>         if bitcount < self.break_even:
>             return val1 * val2
>         shift = bitcount / 2
>         (hi1, lo1) = split_num(val1, shift)
>         (hi2, lo2) = split_num(val2, shift)

Need a fancier strategy when nbits(val1) and nbits(val2) are far apart.
This strategy takes strong advantage of the special-case-for-0 test above
<wink>, but can easily run 100s of times slower than the builtin mult; e.g.,
change your test case to lambda x:x*5.

This is all the more reason to write in Python, though!  Much clumsier to
handle imbalance in C.  In the meantime, using "min" instead of "max" to
compute bitcount greatly speeds the worst cases, and doesn't hurt the best
ones.

> ...
> def nbits(x):
>     """not clean, rather fast, inaccurate"""
>     s=marshal.dumps(x)
>     return abs(struct.unpack("i", s[1:5])[0])*15 -7

I've been using the following since 1.5.2 came out.  Cleaner and accurate.
The one above is substantially faster, but it's now a small-factor thing
rather than an O() difference:

def nbits(n, hex=hex, long=long, len=len,
             d2bits={'0':0,
                     '1':1,
                     '2':2, '3':2,
                     '4':3, '5':3, '6':3, '7':3,
                     '8':4, '9':4, 'A':4, 'B':4,
                     'C':4, 'D':4, 'E':4, 'F':4}):
    """nbits(n): return number of bits in n.

    n must be >= 0; 0->0, 1->1, 2->2, 3->2, 4->3, 5->3, ...

    In 1.5.1 and before this takes time quadratic in the number
    of bits.  In 1.5.2, linear.  "Should be" constant time, but
    that requires getting at the internal representation.
    """

    if n < 0:
        raise ValueError("nbits arg must be >= 0")
    # hex representation is ashex[2:-1] (leading 0x and trailing L).
    ashex = hex(long(n))
    num_hex_digits = len(ashex) - 3
    return (num_hex_digits - 1)*4 + d2bits[ashex[2]]

> ...
> def timing(func, args, n=1, **keywords) :
> 	import time
> 	time=time.time

Not time.clock?

> ...
> # the end ----------------------------------------------------

looks-more-like-the-beginning-to-me<wink>-ly y'rs  - tim






More information about the Python-list mailing list