Mailman 3 C-coded dot 1000x faster than numpy? - NumPy-Discussion

C-coded dot 1000x faster than numpy?

Neal Becker

Feb. 23, 2021

10:09 a.m.

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck. So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own. On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on? import numpy as np from dot2x1 import dot2x1 a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j]) def F1(): d = dot2x1 (a, b) def F2(): d = np.dot (a, b) from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000)) In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it

Show replies by date

Andrea Gavana

February 2021

10:32 a.m.

Hi, On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbecker2@gmail.com> wrote:

...

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM. It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster. Sorry if It doesn’t help much. Andrea.

...

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Roman Yurchak

10:40 a.m.

For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468 In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS: https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-i... Roman On 23/02/2021 19:32, Andrea Gavana wrote:

...

Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbecker2@gmail.com <mailto:ndbecker2@gmail.com>> wrote:

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion <https://mail.python.org/mailman/listinfo/numpy-discussion>

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Neal Becker

10:51 a.m.

One suspect is that it seems the numpy version was multi-threading. This isn't useful here, because I'm running parallel monte-carlo simulations using all cores. Perhaps this is perversely slowing things down? I don't know how to account for 1000x slowdown though. On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <rth.yurchak@gmail.com> wrote:

...

For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468

In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS: https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-i...

Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

...
Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbecker2@gmail.com <mailto:ndbecker2@gmail.com>> wrote:

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion <https://mail.python.org/mailman/listinfo/numpy-discussion>

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

-- Those who don't understand recursion are doomed to repeat it

Neal Becker

11:19 a.m.

I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. On Tue, Feb 23, 2021 at 1:51 PM Neal Becker <ndbecker2@gmail.com> wrote:

...

One suspect is that it seems the numpy version was multi-threading. This isn't useful here, because I'm running parallel monte-carlo simulations using all cores. Perhaps this is perversely slowing things down? I don't know how to account for 1000x slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <rth.yurchak@gmail.com> wrote:

...
For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468

In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS: https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-i...

Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

...
Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbecker2@gmail.com <mailto:ndbecker2@gmail.com>> wrote:

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion <https://mail.python.org/mailman/listinfo/numpy-discussion>

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

-- Those who don't understand recursion are doomed to repeat it

-- Those who don't understand recursion are doomed to repeat it

Carl Kleffner

11:19 a.m.

https://stackoverflow.com/questions/19839539/how-to-get-faster-code-than-num... maybe C_CONTIGUOUS vs F_CONTIGUOUS? Carl Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker <ndbecker2@gmail.com

...

:

...

One suspect is that it seems the numpy version was multi-threading. This isn't useful here, because I'm running parallel monte-carlo simulations using all cores. Perhaps this is perversely slowing things down? I don't know how to account for 1000x slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak <rth.yurchak@gmail.com> wrote:

...
For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise

https://github.com/numpy/numpy/issues/10468

...
In general, it might be worth trying different BLAS backends. For instance, if you install numpy from conda-forge you should be able to switch between OpenBLAS, MKL and BLIS:

https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-i...

...
Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

...
Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbecker2@gmail.com <mailto:ndbecker2@gmail.com>> wrote:

I have code that performs dot product of a 2D matrix of size (on

the

...
...
order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is

on

...
the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it

has

...
some dependencies. Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7, so my comments are easily outdated and irrelevant. This was on Windows 10 64 bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended up using fortran + f2py. Even with the overhead of having to go through f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit -- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion <https://mail.python.org/mailman/listinfo/numpy-discussion>

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

-- Those who don't understand recursion are doomed to repeat it _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

David Menéndez Hurtado

11:46 a.m.

On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <rth.yurchak@gmail.com> wrote:

...

For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468

I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast. My configuration (also in Fedora 33) np.show_config() blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)]

Carl Kleffner

1:03 p.m.

The stackoverflow link above contains a simple testcase:

...

...
...
from scipy.linalg import get_blas_funcs>>> gemm = get_blas_funcs("gemm", [X, Y])>>> np.all(gemm(1, X, Y) == np.dot(X, Y))True

It would be of interest to benchmark gemm against np.dot. Maybe np.dot doesn't use blas at al for whatever reason? Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado < davidmenhur@gmail.com>:

...

On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak, <rth.yurchak@gmail.com> wrote:

...
For the first benchmark apparently A.dot(B) with A real and B complex is a known issue performance wise https://github.com/numpy/numpy/issues/10468

I splitted B into a vector of size (N, 2) for the real and imaginary part, and that makes the multiplication twice as fast.

My configuration (also in Fedora 33) np.show_config()

blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/lib'] language = c define_macros = [('HAVE_CBLAS', None)] _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Charles R Harris

4:47 p.m.

On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <ndbecker2@gmail.com> wrote:

...

I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit

I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices. Chuck

Charles R Harris

4:59 p.m.

On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <ndbecker2@gmail.com> wrote:

...
I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit

I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.

What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS. Chuck

Neal Becker

4:35 a.m.

See my earlier email - this is fedora 33, python3.9. I'm using fedora 33 standard numpy. ldd says: /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000) So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Configured BLAS libraries: System-wide (/etc/flexiblasrc): System-wide from config directory (/etc/flexiblasrc.d/) OPENBLAS-OPENMP library = libflexiblas_openblas-openmp.so comment = NETLIB library = libflexiblas_netlib.so comment = ATLAS library = libflexiblas_atlas.so comment = User config (/home/nbecker/.flexiblasrc): Host config (/home/nbecker/.flexiblasrc.nbecker8): Available hooks: Backend and hook search paths: /usr/lib64/flexiblas/ Default BLAS: System: OPENBLAS-OPENMP User: (none) Host: (none) Active Default: OPENBLAS-OPENMP (System) Runtime properties: verbose = 0 (System) So it looks to me it is using openblas-openmp. On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <charlesr.harris@gmail.com> wrote:

...
On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <ndbecker2@gmail.com> wrote:

...
I have code that performs dot product of a 2D matrix of size (on the order of) [1000,16] with a vector of size [1000]. The matrix is float64 and the vector is complex128. I was using numpy.dot but it turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the interface). No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on the order of 1000x faster than numpy? Here is the test code: My custom c++ code is dot2x1. I'm not copying it here because it has some dependencies. Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16)) b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j, -0.80311816+0.80311816j, -0.80311816-0.80311816j, 1.09707981+0.29396165j, 1.09707981-0.29396165j, -1.09707981+0.29396165j, -1.09707981-0.29396165j, 0.29396165+1.09707981j, 0.29396165-1.09707981j, -0.29396165+1.09707981j, -0.29396165-1.09707981j, 0.25495815+0.25495815j, 0.25495815-0.25495815j, -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1(): d = dot2x1 (a, b)

def F2(): d = np.dot (a, b)

from timeit import timeit print (timeit ('F1()', globals=globals(), number=1000)) print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit 28.608758996007964 << 2nd timeit

I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.

What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.

Chuck _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

-- Those who don't understand recursion are doomed to repeat it

Charles R Harris

7:02 a.m.

On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <ndbecker2@gmail.com> wrote:

...

See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy. ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_ multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Configured BLAS libraries: System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/) OPENBLAS-OPENMP library = libflexiblas_openblas-openmp.so comment = NETLIB library = libflexiblas_netlib.so comment = ATLAS library = libflexiblas_atlas.so comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths: /usr/lib64/flexiblas/

Default BLAS: System: OPENBLAS-OPENMP User: (none) Host: (none) Active Default: OPENBLAS-OPENMP (System) Runtime properties: verbose = 0 (System)

So it looks to me it is using openblas-openmp.

ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas 1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS 2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS 3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS 5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS 7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads. Chuck

Charles R Harris

7:12 a.m.

On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <ndbecker2@gmail.com> wrote:

...
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy. ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_ multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Configured BLAS libraries: System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/) OPENBLAS-OPENMP library = libflexiblas_openblas-openmp.so comment = NETLIB library = libflexiblas_netlib.so comment = ATLAS library = libflexiblas_atlas.so comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths: /usr/lib64/flexiblas/

Default BLAS: System: OPENBLAS-OPENMP User: (none) Host: (none) Active Default: OPENBLAS-OPENMP (System) Runtime properties: verbose = 0 (System)

So it looks to me it is using openblas-openmp.

ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas

1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS 2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS 3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS 5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS 7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)

I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.

Seems to be controlled in the /etc directory: /etc/flexiblas64rc.d/openblas-openmp64.conf On my machine it looks like openmp64 is the system default. Chuck

Neal Becker

9:31 a.m.

Supposedly can control through env variables but I didn't see any effect On Wed, Feb 24, 2021, 10:12 AM Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Wed, Feb 24, 2021 at 8:02 AM Charles R Harris < charlesr.harris@gmail.com> wrote:

...
On Wed, Feb 24, 2021 at 5:36 AM Neal Becker <ndbecker2@gmail.com> wrote:

...
See my earlier email - this is fedora 33, python3.9.

I'm using fedora 33 standard numpy. ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_ multiarray_umath.cpython-39-x86_64-linux-gnu.so: linux-vdso.so.1 (0x00007ffdd1487000) libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)

So whatever flexiblas is doing controls blas. flexiblas print FlexiBLAS, version 3.0.4 Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler and others. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Configured BLAS libraries: System-wide (/etc/flexiblasrc):

System-wide from config directory (/etc/flexiblasrc.d/) OPENBLAS-OPENMP library = libflexiblas_openblas-openmp.so comment = NETLIB library = libflexiblas_netlib.so comment = ATLAS library = libflexiblas_atlas.so comment =

User config (/home/nbecker/.flexiblasrc):

Host config (/home/nbecker/.flexiblasrc.nbecker8):

Available hooks:

Backend and hook search paths: /usr/lib64/flexiblas/

Default BLAS: System: OPENBLAS-OPENMP User: (none) Host: (none) Active Default: OPENBLAS-OPENMP (System) Runtime properties: verbose = 0 (System)

So it looks to me it is using openblas-openmp.

ISTR that there have been problems with openmp. There are a ton of OpenBLAS versions available in fedora 33. Just available via flexiblas

1. flexiblas-openblas-openmp.x86_64 : FlexiBLAS wrappers for OpenBLAS 2. flexiblas-openblas-openmp.i686 : FlexiBLAS wrappers for OpenBLAS 3. flexiblas-openblas-openmp64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 4. flexiblas-openblas-serial.x86_64 : FlexiBLAS wrappers for OpenBLAS 5. flexiblas-openblas-serial64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit) 6. flexiblas-openblas-threads.x86_64 : FlexiBLAS wrappers for OpenBLAS 7. flexiblas-openblas-threads64.x86_64 : FlexiBLAS wrappers for OpenBLAS (64-bit)

I am not sure how to make use of flexiblas, but would explore that. We might need to do something with distutils to interoperate with it or maybe you can control it though site.cfg. There are 12 versions available in total. I would suggest trying serial or pthreads.

Seems to be controlled in the /etc directory:

/etc/flexiblas64rc.d/openblas-openmp64.conf

On my machine it looks like openmp64 is the system default.

Chuck _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

1452

Age (days ago)

1453

Last active (days ago)

List overview

Download

13 comments

6 participants

participants (6)

Andrea Gavana
Carl Kleffner
Charles R Harris
David Menéndez Hurtado
Neal Becker
Roman Yurchak

C-coded dot 1000x faster than numpy?

tags

participants (6)