[Numpy-discussion] C-coded dot 1000x faster than numpy?

Wed Feb 24 07:35:37 EST 2021

See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.

Configured BLAS libraries:
System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/)
 OPENBLAS-OPENMP
   library = libflexiblas_openblas-openmp.so
   comment =
 NETLIB
   library = libflexiblas_netlib.so
   comment =
 ATLAS
   library = libflexiblas_atlas.so
   comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths:
  /usr/lib64/flexiblas/

Default BLAS:
    System:       OPENBLAS-OPENMP
    User:         (none)
    Host:         (none)
    Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
   verbose = 0 (System)

So it looks  to me it is using openblas-openmp.

On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
>
> On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <ndbecker2 at gmail.com> wrote:
>>>
>>> I have code that performs dot product of a 2D matrix of size (on the
>>> order of) [1000,16] with a vector of size [1000].  The matrix is
>>> float64 and the vector is complex128.  I was using numpy.dot but it
>>> turned out to be a bottleneck.
>>>
>>> So I coded dot2x1 in c++ (using xtensor-python just for the
>>> interface).  No fancy simd was used, unless g++ did it on it's own.
>>>
>>> On a simple benchmark using timeit I find my hand-coded routine is on
>>> the order of 1000x faster than numpy?  Here is the test code:
>>> My custom c++ code is dot2x1.  I'm not copying it here because it has
>>> some dependencies.  Any idea what is going on?
>>>
>>> import numpy as np
>>>
>>> from dot2x1 import dot2x1
>>>
>>> a = np.ones ((1000,16))
>>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>>>        -0.80311816+0.80311816j, -0.80311816-0.80311816j,
>>>         1.09707981+0.29396165j,  1.09707981-0.29396165j,
>>>        -1.09707981+0.29396165j, -1.09707981-0.29396165j,
>>>         0.29396165+1.09707981j,  0.29396165-1.09707981j,
>>>        -0.29396165+1.09707981j, -0.29396165-1.09707981j,
>>>         0.25495815+0.25495815j,  0.25495815-0.25495815j,
>>>        -0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>>
>>> def F1():
>>>     d = dot2x1 (a, b)
>>>
>>> def F2():
>>>     d = np.dot (a, b)
>>>
>>> from timeit import timeit
>>> print (timeit ('F1()', globals=globals(), number=1000))
>>> print (timeit ('F2()', globals=globals(), number=1000))
>>>
>>> In [13]: 0.013910860987380147 << 1st timeit
>>> 28.608758996007964  << 2nd timeit
>>
>>
>> I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.
>>
>
> What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.
>
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

--
Those who don't understand recursion are doomed to repeat it