[SciPy-User] fast small matrix multiplication with cython?

Mon Mar 14 11:13:27 EDT 2011

On Mon, Mar 14, 2011 at 11:12 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Thu, Dec 9, 2010 at 5:01 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Thu, Dec 9, 2010 at 4:33 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>> On Wed, Dec 8, 2010 at 11:28 PM,  <josef.pktd at gmail.com> wrote:
>>>>>
>>>>> It looks like I don't save too much time with just Python/scipy
>>>>> optimizations.  Apparently ~75% of the time is spent in l-bfgs-b,
>>>>> judging by its user time output and the profiler's CPU time output(?).
>>>>>  Non-cython versions:
>>>>>
>>>>> Brief and rough profiling on my laptop for ARMA(2,2) with 1000
>>>>> observations.  Optimization uses fmin_l_bfgs_b with m = 12 and iprint
>>>>> = 0.
>>>>
>>>> Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
>>>> With l-bfgs-b, you should be able to replace the derivatives with the
>>>> complex step derivatives that calculate the loglike function value and
>>>> the derivatives in one iteration.
>>>>
>>>
>>> I couldn't figure out how to use it without some hacks.  The
>>> fmin_l_bfgs_b will call both f and fprime as (x, *args), but
>>> approx_fprime or approx_fprime_cs need actually approx_fprime(x, func,
>>> args=args) and call func(x, *args).  I changed fmin_l_bfgs_b to make
>>> the call like this for the gradient, and I get (different computer)
>>>
>>>
>>> Using approx_fprime_cs
>>> -----------------------------------
>>>         861609 function calls (861525 primitive calls) in 3.337 CPU seconds
>>>
>>>   Ordered by: internal time
>>>
>>>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>       70    1.942    0.028    3.213    0.046 kalmanf.py:504(loglike)
>>>   840296    1.229    0.000    1.229    0.000 {numpy.core._dotblas.dot}
>>>       56    0.038    0.001    0.038    0.001 {numpy.linalg.lapack_lite.zgesv}
>>>      270    0.025    0.000    0.025    0.000 {sum}
>>>       90    0.019    0.000    0.019    0.000 {numpy.linalg.lapack_lite.dgesdd}
>>>       46    0.013    0.000    0.014    0.000
>>> function_base.py:494(asarray_chkfinite)
>>>      162    0.012    0.000    0.014    0.000 arima.py:117(_transparams)
>>>
>>>
>>> Using approx_grad = True
>>> ---------------------------------------
>>>         1097454 function calls (1097370 primitive calls) in 3.615 CPU seconds
>>>
>>>   Ordered by: internal time
>>>
>>>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>       90    2.316    0.026    3.489    0.039 kalmanf.py:504(loglike)
>>>  1073757    1.164    0.000    1.164    0.000 {numpy.core._dotblas.dot}
>>>      270    0.025    0.000    0.025    0.000 {sum}
>>>       90    0.020    0.000    0.020    0.000 {numpy.linalg.lapack_lite.dgesdd}
>>>      182    0.014    0.000    0.016    0.000 arima.py:117(_transparams)
>>>       46    0.013    0.000    0.014    0.000
>>> function_base.py:494(asarray_chkfinite)
>>>       46    0.008    0.000    0.023    0.000 decomp_svd.py:12(svd)
>>>       23    0.004    0.000    0.004    0.000 {method 'var' of
>>> 'numpy.ndarray' objects}
>>>
>>>
>>> Definitely less function calls and a little faster, but I had to write
>>> some hacks to get it to work.
>>>
>>
>> This is more like it!  With fast recursions in Cython:
>>
>>         15186 function calls (15102 primitive calls) in 0.750 CPU seconds
>>
>>   Ordered by: internal time
>>
>>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>       18    0.622    0.035    0.625    0.035
>> kalman_loglike.pyx:15(kalman_loglike)
>>      270    0.024    0.000    0.024    0.000 {sum}
>>       90    0.019    0.000    0.019    0.000 {numpy.linalg.lapack_lite.dgesdd}
>>      156    0.013    0.000    0.013    0.000 {numpy.core._dotblas.dot}
>>       46    0.013    0.000    0.014    0.000
>> function_base.py:494(asarray_chkfinite)
>>      110    0.008    0.000    0.010    0.000 arima.py:118(_transparams)
>>       46    0.008    0.000    0.023    0.000 decomp_svd.py:12(svd)
>>       23    0.004    0.000    0.004    0.000 {method 'var' of
>> 'numpy.ndarray' objects}
>>       26    0.004    0.000    0.004    0.000 tsatools.py:109(lagmat)
>>       90    0.004    0.000    0.042    0.000 arima.py:197(loglike_css)
>>       81    0.004    0.000    0.004    0.000
>> {numpy.core.multiarray._fastCopyAndTranspose}
>>
>> I can live with this for now.
>>
>> Skipper
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
> Revisiting this topic from a few months ago. I was able to get Tokyo
> (see github.com/tokyo/tokyo and
> http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/)
> to build against (ATLAS? or MKL?) in EPD 7.0 with some modifications
> to the setup.py, see my fork on github:
>
> https://github.com/wesm/tokyo/blob/master/setup.py
>
> Someone who knows a bit more about the linking against multiple
> versions of BLAS/ATLAS/MKL might be able to make it work more
> generally-- I basically just poked around numpy/core/setup.py
>
> The speedups for small matrices suggest it would probably be
> worthwhile to have in our toolbelt:
>
> $ python double_speed.py
>
> <snip>
>
> SPEED TEST BLAS 2
>
> Double precision: Vector size = 4  Matrix size = 4x4
>
> numpy.dot +:        312 kc/s
> dgemv:             2857 kc/s   9.1x
> dgemv3:            9091 kc/s  29.1x
> dgemv5:            9375 kc/s  30.0x
> dgemv6:            9375 kc/s  30.0x
> dgemv_:           11321 kc/s  36.2x
>
> numpy.outer:        118 kc/s
> dger:              2344 kc/s  19.8x
> dger3:             7895 kc/s  66.7x
> dger4:             8108 kc/s  68.5x
> dger_:             9449 kc/s  79.8x
>
> Double precision: Vector size = 15  Matrix size = 15x15
>
> numpy.dot +:        296 kc/s
> dgemv:             2000 kc/s   6.8x
> dgemv3:            4444 kc/s  15.0x
> dgemv5:            5000 kc/s  16.9x
> dgemv6:            4615 kc/s  15.6x
> dgemv_:            5217 kc/s  17.6x
>
> numpy.outer:         89 kc/s
> dger:              1143 kc/s  12.9x
> dger3:             2330 kc/s  26.2x
> dger4:             2667 kc/s  30.0x
> dger_:             2824 kc/s  31.8x
>
> Double precision: Vector size = 30  Matrix size = 30x30
>
> numpy.dot +:        261 kc/s
> dgemv:             1271 kc/s   4.9x
> dgemv3:            2676 kc/s  10.3x
> dgemv5:            2311 kc/s   8.9x
> dgemv6:            2676 kc/s  10.3x
> dgemv_:            2421 kc/s   9.3x
>
> numpy.outer:         64 kc/s
> dger:               782 kc/s  12.2x
> dger3:             1412 kc/s  22.1x
> dger4:             1182 kc/s  18.5x
> dger_:             1356 kc/s  21.2x
>
>
> SPEED TEST BLAS 3
>
> Double precision: Vector size = 4  Matrix size = 4x4
>
> numpy.dot:        845 kc/s
> dgemm:           2259 kc/s   2.7x
> dgemm3:          4808 kc/s   5.7x
> dgemm5:          4934 kc/s   5.8x
> dgemm7:          4808 kc/s   5.7x
> dgemm_:          5357 kc/s   6.3x
>
> Double precision: Vector size = 15  Matrix size = 15x15
>
> numpy.dot:        290 kc/s
> dgemm:            476 kc/s   1.6x
> dgemm3:           580 kc/s   2.0x
> dgemm5:           606 kc/s   2.1x
> dgemm7:           580 kc/s   2.0x
> dgemm_:           606 kc/s   2.1x
>
> Double precision: Vector size = 30  Matrix size = 30x30
>
> numpy.dot:        108 kc/s
> dgemm:            128 kc/s   1.2x
> dgemm3:           145 kc/s   1.3x
> dgemm5:           139 kc/s   1.3x
> dgemm7:           145 kc/s   1.3x
> dgemm_:           145 kc/s   1.3x
>

I should add that it worked on both OS X and Ubuntu-- have not tested
on Windows (yet)