[Numpy-discussion] np.diag(np.dot(A, B))

Daπid davidmenhur at gmail.com
Fri May 22 06:53:26 EDT 2015


On 22 May 2015 at 12:15, Mathieu Blondel <mathieu at mblondel.org> wrote:

> Right now I am using np.sum(A * B.T, axis=1) for dense data and I have
> implemented a Cython routine for sparse data.
> I haven't benched np.sum(A * B.T, axis=1) vs. np.einsum("ij,ji->i", A, B)
> yet since I am mostly interested in the sparse case right now.
>


In my system, einsum seems to be faster.


In [3]: N = 256

In [4]: A = np.random.random((N, N))

In [5]: B = np.random.random((N, N))

In [6]: %timeit np.sum(A * B.T, axis=1)
1000 loops, best of 3: 260 µs per loop

In [7]: %timeit  np.einsum("ij,ji->i", A, B)
10000 loops, best of 3: 147 µs per loop


In [9]: N = 1023

In [10]: A = np.random.random((N, N))

In [11]: B = np.random.random((N, N))

In [12]: %timeit np.sum(A * B.T, axis=1)
100 loops, best of 3: 14 ms per loop

In [13]: %timeit  np.einsum("ij,ji->i", A, B)
100 loops, best of 3: 10.7 ms per loop


I have ATLAS installed from the Fedora repos, so not tuned; but einsum is
only using one thread anyway, so probably it is not using it (definitely
not computing the full dot, because that already takes 200 ms).

If B is in FORTRAN order, it is much faster (for N=5000).

In [25]: Bf = B.copy(order='F')

In [26]: %timeit  np.einsum("ij,ji->i", A, Bf)
10 loops, best of 3: 25.7 ms per loop

In [27]: %timeit  np.einsum("ij,ji->i", A, B)
1 loops, best of 3: 404 ms per loop

In [29]: %timeit np.sum(A * Bf.T, axis=1)
10 loops, best of 3: 118 ms per loop

In [30]: %timeit np.sum(A * B.T, axis=1)
1 loops, best of 3: 517 ms per loop

But the copy is not worth it:

In [31]: %timeit Bf = B.copy(order='F')
1 loops, best of 3: 463 ms per loop



/David.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150522/adc8f15a/attachment.html>


More information about the NumPy-Discussion mailing list