[SciPy-User] fast small matrix multiplication with cython?
Wes McKinney
wesmckinn at gmail.com
Mon Mar 14 11:12:22 EDT 2011
On Thu, Dec 9, 2010 at 5:01 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Thu, Dec 9, 2010 at 4:33 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Wed, Dec 8, 2010 at 11:28 PM, <josef.pktd at gmail.com> wrote:
>>>>
>>>> It looks like I don't save too much time with just Python/scipy
>>>> optimizations. Apparently ~75% of the time is spent in l-bfgs-b,
>>>> judging by its user time output and the profiler's CPU time output(?).
>>>> Non-cython versions:
>>>>
>>>> Brief and rough profiling on my laptop for ARMA(2,2) with 1000
>>>> observations. Optimization uses fmin_l_bfgs_b with m = 12 and iprint
>>>> = 0.
>>>
>>> Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
>>> With l-bfgs-b, you should be able to replace the derivatives with the
>>> complex step derivatives that calculate the loglike function value and
>>> the derivatives in one iteration.
>>>
>>
>> I couldn't figure out how to use it without some hacks. The
>> fmin_l_bfgs_b will call both f and fprime as (x, *args), but
>> approx_fprime or approx_fprime_cs need actually approx_fprime(x, func,
>> args=args) and call func(x, *args). I changed fmin_l_bfgs_b to make
>> the call like this for the gradient, and I get (different computer)
>>
>>
>> Using approx_fprime_cs
>> -----------------------------------
>> 861609 function calls (861525 primitive calls) in 3.337 CPU seconds
>>
>> Ordered by: internal time
>>
>> ncalls tottime percall cumtime percall filename:lineno(function)
>> 70 1.942 0.028 3.213 0.046 kalmanf.py:504(loglike)
>> 840296 1.229 0.000 1.229 0.000 {numpy.core._dotblas.dot}
>> 56 0.038 0.001 0.038 0.001 {numpy.linalg.lapack_lite.zgesv}
>> 270 0.025 0.000 0.025 0.000 {sum}
>> 90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}
>> 46 0.013 0.000 0.014 0.000
>> function_base.py:494(asarray_chkfinite)
>> 162 0.012 0.000 0.014 0.000 arima.py:117(_transparams)
>>
>>
>> Using approx_grad = True
>> ---------------------------------------
>> 1097454 function calls (1097370 primitive calls) in 3.615 CPU seconds
>>
>> Ordered by: internal time
>>
>> ncalls tottime percall cumtime percall filename:lineno(function)
>> 90 2.316 0.026 3.489 0.039 kalmanf.py:504(loglike)
>> 1073757 1.164 0.000 1.164 0.000 {numpy.core._dotblas.dot}
>> 270 0.025 0.000 0.025 0.000 {sum}
>> 90 0.020 0.000 0.020 0.000 {numpy.linalg.lapack_lite.dgesdd}
>> 182 0.014 0.000 0.016 0.000 arima.py:117(_transparams)
>> 46 0.013 0.000 0.014 0.000
>> function_base.py:494(asarray_chkfinite)
>> 46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
>> 23 0.004 0.000 0.004 0.000 {method 'var' of
>> 'numpy.ndarray' objects}
>>
>>
>> Definitely less function calls and a little faster, but I had to write
>> some hacks to get it to work.
>>
>
> This is more like it! With fast recursions in Cython:
>
> 15186 function calls (15102 primitive calls) in 0.750 CPU seconds
>
> Ordered by: internal time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 18 0.622 0.035 0.625 0.035
> kalman_loglike.pyx:15(kalman_loglike)
> 270 0.024 0.000 0.024 0.000 {sum}
> 90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}
> 156 0.013 0.000 0.013 0.000 {numpy.core._dotblas.dot}
> 46 0.013 0.000 0.014 0.000
> function_base.py:494(asarray_chkfinite)
> 110 0.008 0.000 0.010 0.000 arima.py:118(_transparams)
> 46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
> 23 0.004 0.000 0.004 0.000 {method 'var' of
> 'numpy.ndarray' objects}
> 26 0.004 0.000 0.004 0.000 tsatools.py:109(lagmat)
> 90 0.004 0.000 0.042 0.000 arima.py:197(loglike_css)
> 81 0.004 0.000 0.004 0.000
> {numpy.core.multiarray._fastCopyAndTranspose}
>
> I can live with this for now.
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
Revisiting this topic from a few months ago. I was able to get Tokyo
(see github.com/tokyo/tokyo and
http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/)
to build against (ATLAS? or MKL?) in EPD 7.0 with some modifications
to the setup.py, see my fork on github:
https://github.com/wesm/tokyo/blob/master/setup.py
Someone who knows a bit more about the linking against multiple
versions of BLAS/ATLAS/MKL might be able to make it work more
generally-- I basically just poked around numpy/core/setup.py
The speedups for small matrices suggest it would probably be
worthwhile to have in our toolbelt:
$ python double_speed.py
<snip>
SPEED TEST BLAS 2
Double precision: Vector size = 4 Matrix size = 4x4
numpy.dot +: 312 kc/s
dgemv: 2857 kc/s 9.1x
dgemv3: 9091 kc/s 29.1x
dgemv5: 9375 kc/s 30.0x
dgemv6: 9375 kc/s 30.0x
dgemv_: 11321 kc/s 36.2x
numpy.outer: 118 kc/s
dger: 2344 kc/s 19.8x
dger3: 7895 kc/s 66.7x
dger4: 8108 kc/s 68.5x
dger_: 9449 kc/s 79.8x
Double precision: Vector size = 15 Matrix size = 15x15
numpy.dot +: 296 kc/s
dgemv: 2000 kc/s 6.8x
dgemv3: 4444 kc/s 15.0x
dgemv5: 5000 kc/s 16.9x
dgemv6: 4615 kc/s 15.6x
dgemv_: 5217 kc/s 17.6x
numpy.outer: 89 kc/s
dger: 1143 kc/s 12.9x
dger3: 2330 kc/s 26.2x
dger4: 2667 kc/s 30.0x
dger_: 2824 kc/s 31.8x
Double precision: Vector size = 30 Matrix size = 30x30
numpy.dot +: 261 kc/s
dgemv: 1271 kc/s 4.9x
dgemv3: 2676 kc/s 10.3x
dgemv5: 2311 kc/s 8.9x
dgemv6: 2676 kc/s 10.3x
dgemv_: 2421 kc/s 9.3x
numpy.outer: 64 kc/s
dger: 782 kc/s 12.2x
dger3: 1412 kc/s 22.1x
dger4: 1182 kc/s 18.5x
dger_: 1356 kc/s 21.2x
SPEED TEST BLAS 3
Double precision: Vector size = 4 Matrix size = 4x4
numpy.dot: 845 kc/s
dgemm: 2259 kc/s 2.7x
dgemm3: 4808 kc/s 5.7x
dgemm5: 4934 kc/s 5.8x
dgemm7: 4808 kc/s 5.7x
dgemm_: 5357 kc/s 6.3x
Double precision: Vector size = 15 Matrix size = 15x15
numpy.dot: 290 kc/s
dgemm: 476 kc/s 1.6x
dgemm3: 580 kc/s 2.0x
dgemm5: 606 kc/s 2.1x
dgemm7: 580 kc/s 2.0x
dgemm_: 606 kc/s 2.1x
Double precision: Vector size = 30 Matrix size = 30x30
numpy.dot: 108 kc/s
dgemm: 128 kc/s 1.2x
dgemm3: 145 kc/s 1.3x
dgemm5: 139 kc/s 1.3x
dgemm7: 145 kc/s 1.3x
dgemm_: 145 kc/s 1.3x
More information about the SciPy-User
mailing list