[SciPy-User] fast small matrix multiplication with cython?

Mon Mar 14 11:12:22 EDT 2011

On Thu, Dec 9, 2010 at 5:01 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Thu, Dec 9, 2010 at 4:33 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Wed, Dec 8, 2010 at 11:28 PM,  <josef.pktd at gmail.com> wrote:
>>>>
>>>> It looks like I don't save too much time with just Python/scipy
>>>> optimizations.  Apparently ~75% of the time is spent in l-bfgs-b,
>>>> judging by its user time output and the profiler's CPU time output(?).
>>>>  Non-cython versions:
>>>>
>>>> Brief and rough profiling on my laptop for ARMA(2,2) with 1000
>>>> observations.  Optimization uses fmin_l_bfgs_b with m = 12 and iprint
>>>> = 0.
>>>
>>> Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
>>> With l-bfgs-b, you should be able to replace the derivatives with the
>>> complex step derivatives that calculate the loglike function value and
>>> the derivatives in one iteration.
>>>
>>
>> I couldn't figure out how to use it without some hacks.  The
>> fmin_l_bfgs_b will call both f and fprime as (x, *args), but
>> approx_fprime or approx_fprime_cs need actually approx_fprime(x, func,
>> args=args) and call func(x, *args).  I changed fmin_l_bfgs_b to make
>> the call like this for the gradient, and I get (different computer)
>>
>>
>> Using approx_fprime_cs
>> -----------------------------------
>>         861609 function calls (861525 primitive calls) in 3.337 CPU seconds
>>
>>   Ordered by: internal time
>>
>>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>       70    1.942    0.028    3.213    0.046 kalmanf.py:504(loglike)
>>   840296    1.229    0.000    1.229    0.000 {numpy.core._dotblas.dot}
>>       56    0.038    0.001    0.038    0.001 {numpy.linalg.lapack_lite.zgesv}
>>      270    0.025    0.000    0.025    0.000 {sum}
>>       90    0.019    0.000    0.019    0.000 {numpy.linalg.lapack_lite.dgesdd}
>>       46    0.013    0.000    0.014    0.000
>> function_base.py:494(asarray_chkfinite)
>>      162    0.012    0.000    0.014    0.000 arima.py:117(_transparams)
>>
>>
>> Using approx_grad = True
>> ---------------------------------------
>>         1097454 function calls (1097370 primitive calls) in 3.615 CPU seconds
>>
>>   Ordered by: internal time
>>
>>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>       90    2.316    0.026    3.489    0.039 kalmanf.py:504(loglike)
>>  1073757    1.164    0.000    1.164    0.000 {numpy.core._dotblas.dot}
>>      270    0.025    0.000    0.025    0.000 {sum}
>>       90    0.020    0.000    0.020    0.000 {numpy.linalg.lapack_lite.dgesdd}
>>      182    0.014    0.000    0.016    0.000 arima.py:117(_transparams)
>>       46    0.013    0.000    0.014    0.000
>> function_base.py:494(asarray_chkfinite)
>>       46    0.008    0.000    0.023    0.000 decomp_svd.py:12(svd)
>>       23    0.004    0.000    0.004    0.000 {method 'var' of
>> 'numpy.ndarray' objects}
>>
>>
>> Definitely less function calls and a little faster, but I had to write
>> some hacks to get it to work.
>>
>
> This is more like it!  With fast recursions in Cython:
>
>         15186 function calls (15102 primitive calls) in 0.750 CPU seconds
>
>   Ordered by: internal time
>
>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>       18    0.622    0.035    0.625    0.035
> kalman_loglike.pyx:15(kalman_loglike)
>      270    0.024    0.000    0.024    0.000 {sum}
>       90    0.019    0.000    0.019    0.000 {numpy.linalg.lapack_lite.dgesdd}
>      156    0.013    0.000    0.013    0.000 {numpy.core._dotblas.dot}
>       46    0.013    0.000    0.014    0.000
> function_base.py:494(asarray_chkfinite)
>      110    0.008    0.000    0.010    0.000 arima.py:118(_transparams)
>       46    0.008    0.000    0.023    0.000 decomp_svd.py:12(svd)
>       23    0.004    0.000    0.004    0.000 {method 'var' of
> 'numpy.ndarray' objects}
>       26    0.004    0.000    0.004    0.000 tsatools.py:109(lagmat)
>       90    0.004    0.000    0.042    0.000 arima.py:197(loglike_css)
>       81    0.004    0.000    0.004    0.000
> {numpy.core.multiarray._fastCopyAndTranspose}
>
> I can live with this for now.
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Revisiting this topic from a few months ago. I was able to get Tokyo
(see github.com/tokyo/tokyo and
http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/)
to build against (ATLAS? or MKL?) in EPD 7.0 with some modifications
to the setup.py, see my fork on github:

https://github.com/wesm/tokyo/blob/master/setup.py

Someone who knows a bit more about the linking against multiple
versions of BLAS/ATLAS/MKL might be able to make it work more
generally-- I basically just poked around numpy/core/setup.py

The speedups for small matrices suggest it would probably be
worthwhile to have in our toolbelt:

$ python double_speed.py

<snip>

SPEED TEST BLAS 2

Double precision: Vector size = 4  Matrix size = 4x4

numpy.dot +:        312 kc/s
dgemv:             2857 kc/s   9.1x
dgemv3:            9091 kc/s  29.1x
dgemv5:            9375 kc/s  30.0x
dgemv6:            9375 kc/s  30.0x
dgemv_:           11321 kc/s  36.2x

numpy.outer:        118 kc/s
dger:              2344 kc/s  19.8x
dger3:             7895 kc/s  66.7x
dger4:             8108 kc/s  68.5x
dger_:             9449 kc/s  79.8x

Double precision: Vector size = 15  Matrix size = 15x15

numpy.dot +:        296 kc/s
dgemv:             2000 kc/s   6.8x
dgemv3:            4444 kc/s  15.0x
dgemv5:            5000 kc/s  16.9x
dgemv6:            4615 kc/s  15.6x
dgemv_:            5217 kc/s  17.6x

numpy.outer:         89 kc/s
dger:              1143 kc/s  12.9x
dger3:             2330 kc/s  26.2x
dger4:             2667 kc/s  30.0x
dger_:             2824 kc/s  31.8x

Double precision: Vector size = 30  Matrix size = 30x30

numpy.dot +:        261 kc/s
dgemv:             1271 kc/s   4.9x
dgemv3:            2676 kc/s  10.3x
dgemv5:            2311 kc/s   8.9x
dgemv6:            2676 kc/s  10.3x
dgemv_:            2421 kc/s   9.3x

numpy.outer:         64 kc/s
dger:               782 kc/s  12.2x
dger3:             1412 kc/s  22.1x
dger4:             1182 kc/s  18.5x
dger_:             1356 kc/s  21.2x

SPEED TEST BLAS 3

Double precision: Vector size = 4  Matrix size = 4x4

numpy.dot:        845 kc/s
dgemm:           2259 kc/s   2.7x
dgemm3:          4808 kc/s   5.7x
dgemm5:          4934 kc/s   5.8x
dgemm7:          4808 kc/s   5.7x
dgemm_:          5357 kc/s   6.3x

Double precision: Vector size = 15  Matrix size = 15x15

numpy.dot:        290 kc/s
dgemm:            476 kc/s   1.6x
dgemm3:           580 kc/s   2.0x
dgemm5:           606 kc/s   2.1x
dgemm7:           580 kc/s   2.0x
dgemm_:           606 kc/s   2.1x

Double precision: Vector size = 30  Matrix size = 30x30

numpy.dot:        108 kc/s
dgemm:            128 kc/s   1.2x
dgemm3:           145 kc/s   1.3x
dgemm5:           139 kc/s   1.3x
dgemm7:           145 kc/s   1.3x
dgemm_:           145 kc/s   1.3x