[Numpy-discussion] C-coded dot 1000x faster than numpy?
Neal Becker
ndbecker2 at gmail.com
Wed Feb 24 07:35:37 EST 2021
See my earlier email - this is fedora 33, python3.9.
I'm using fedora 33 standard numpy.
ldd says:
/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x00007ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f0512787000)
So whatever flexiblas is doing controls blas.
flexiblas print
FlexiBLAS, version 3.0.4
Copyright (C) 2014, 2015, 2016, 2017, 2018, 2019, 2020 Martin Koehler
and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.
Configured BLAS libraries:
System-wide (/etc/flexiblasrc):
System-wide from config directory (/etc/flexiblasrc.d/)
OPENBLAS-OPENMP
library = libflexiblas_openblas-openmp.so
comment =
NETLIB
library = libflexiblas_netlib.so
comment =
ATLAS
library = libflexiblas_atlas.so
comment =
User config (/home/nbecker/.flexiblasrc):
Host config (/home/nbecker/.flexiblasrc.nbecker8):
Available hooks:
Backend and hook search paths:
/usr/lib64/flexiblas/
Default BLAS:
System: OPENBLAS-OPENMP
User: (none)
Host: (none)
Active Default: OPENBLAS-OPENMP (System)
Runtime properties:
verbose = 0 (System)
So it looks to me it is using openblas-openmp.
On Tue, Feb 23, 2021 at 8:00 PM Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
>
> On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker <ndbecker2 at gmail.com> wrote:
>>>
>>> I have code that performs dot product of a 2D matrix of size (on the
>>> order of) [1000,16] with a vector of size [1000]. The matrix is
>>> float64 and the vector is complex128. I was using numpy.dot but it
>>> turned out to be a bottleneck.
>>>
>>> So I coded dot2x1 in c++ (using xtensor-python just for the
>>> interface). No fancy simd was used, unless g++ did it on it's own.
>>>
>>> On a simple benchmark using timeit I find my hand-coded routine is on
>>> the order of 1000x faster than numpy? Here is the test code:
>>> My custom c++ code is dot2x1. I'm not copying it here because it has
>>> some dependencies. Any idea what is going on?
>>>
>>> import numpy as np
>>>
>>> from dot2x1 import dot2x1
>>>
>>> a = np.ones ((1000,16))
>>> b = np.array([ 0.80311816+0.80311816j, 0.80311816-0.80311816j,
>>> -0.80311816+0.80311816j, -0.80311816-0.80311816j,
>>> 1.09707981+0.29396165j, 1.09707981-0.29396165j,
>>> -1.09707981+0.29396165j, -1.09707981-0.29396165j,
>>> 0.29396165+1.09707981j, 0.29396165-1.09707981j,
>>> -0.29396165+1.09707981j, -0.29396165-1.09707981j,
>>> 0.25495815+0.25495815j, 0.25495815-0.25495815j,
>>> -0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>>
>>> def F1():
>>> d = dot2x1 (a, b)
>>>
>>> def F2():
>>> d = np.dot (a, b)
>>>
>>> from timeit import timeit
>>> print (timeit ('F1()', globals=globals(), number=1000))
>>> print (timeit ('F2()', globals=globals(), number=1000))
>>>
>>> In [13]: 0.013910860987380147 << 1st timeit
>>> 28.608758996007964 << 2nd timeit
>>
>>
>> I'm going to guess threading, although huge pages can also be a problem on a machine under heavy load running other processes. Call overhead may also matter for such small matrices.
>>
>
> What BLAS library are you using. I get much better results using an 8 year old i5 and ATLAS.
>
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
--
Those who don't understand recursion are doomed to repeat it
More information about the NumPy-Discussion
mailing list