How to improve performance of slow tri*_indices calculations?
Hi, Running on: In []: np.__version__ Out[]: '1.5.1' In []: sys.version Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]' For the reference: In []: X= randn(10, 125) In []: timeit dot(X.T, X) 10000 loops, best of 3: 170 us per loop In []: X= randn(10, 250) In []: timeit dot(X.T, X) 1000 loops, best of 3: 671 us per loop In []: X= randn(10, 500) In []: timeit dot(X.T, X) 100 loops, best of 3: 5.15 ms per loop In []: X= randn(10, 1000) In []: timeit dot(X.T, X) 100 loops, best of 3: 20 ms per loop In []: X= randn(10, 2000) In []: timeit dot(X.T, X) 10 loops, best of 3: 80.7 ms per loop Performance of triu_indices: In []: timeit triu_indices(125) 1000 loops, best of 3: 662 us per loop In []: timeit triu_indices(250) 100 loops, best of 3: 2.55 ms per loop In []: timeit triu_indices(500) 100 loops, best of 3: 15 ms per loop In []: timeit triu_indices(1000) 10 loops, best of 3: 59.8 ms per loop In []: timeit triu_indices(2000) 1 loops, best of 3: 239 ms per loop So the tri*_indices calculations seems to be unreasonable slow compared to for example calculations of inner products. Now, just to compare for a very naive implementation of triu indices. In []: def iut(n): ..: r= np.empty(n* (n+ 1)/ 2, dtype= int) ..: c= r.copy() ..: a= np.arange(n) ..: m= 0 ..: for i in xrange(n): ..: ni= n- i ..: mni= m+ ni ..: r[m: mni]= i ..: c[m: mni]= a[i: n] ..: m+= ni ..: return (r, c) ..: Are we really calculating the same thing? In []: triu_indices(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4])) In []: iut(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4])) Seems so, and then its performance: In []: timeit iut(125) 1000 loops, best of 3: 992 us per loop In []: timeit iut(250) 100 loops, best of 3: 2.03 ms per loop In []: timeit iut(500) 100 loops, best of 3: 5.3 ms per loop In []: timeit iut(1000) 100 loops, best of 3: 13.9 ms per loop In []: timeit iut(2000) 10 loops, best of 3: 39.8 ms per loop Even the naive implementation is very slow, but allready outperforms triu_indices, when n is > 250! So finally my question is how one could substantially improve the performance of indices calculations? Regards, eat
On Mon, Jan 24, 2011 at 4:29 PM, eat <e.antero.tammi@gmail.com> wrote:
Hi,
Running on: In []: np.__version__ Out[]: '1.5.1' In []: sys.version Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]'
For the reference: In []: X= randn(10, 125) In []: timeit dot(X.T, X) 10000 loops, best of 3: 170 us per loop In []: X= randn(10, 250) In []: timeit dot(X.T, X) 1000 loops, best of 3: 671 us per loop In []: X= randn(10, 500) In []: timeit dot(X.T, X) 100 loops, best of 3: 5.15 ms per loop In []: X= randn(10, 1000) In []: timeit dot(X.T, X) 100 loops, best of 3: 20 ms per loop In []: X= randn(10, 2000) In []: timeit dot(X.T, X) 10 loops, best of 3: 80.7 ms per loop
Performance of triu_indices: In []: timeit triu_indices(125) 1000 loops, best of 3: 662 us per loop In []: timeit triu_indices(250) 100 loops, best of 3: 2.55 ms per loop In []: timeit triu_indices(500) 100 loops, best of 3: 15 ms per loop In []: timeit triu_indices(1000) 10 loops, best of 3: 59.8 ms per loop In []: timeit triu_indices(2000) 1 loops, best of 3: 239 ms per loop
So the tri*_indices calculations seems to be unreasonable slow compared to for example calculations of inner products.
Now, just to compare for a very naive implementation of triu indices. In []: def iut(n): ..: r= np.empty(n* (n+ 1)/ 2, dtype= int) ..: c= r.copy() ..: a= np.arange(n) ..: m= 0 ..: for i in xrange(n): ..: ni= n- i ..: mni= m+ ni ..: r[m: mni]= i ..: c[m: mni]= a[i: n] ..: m+= ni ..: return (r, c) ..:
Are we really calculating the same thing? In []: triu_indices(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4])) In []: iut(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Seems so, and then its performance: In []: timeit iut(125) 1000 loops, best of 3: 992 us per loop In []: timeit iut(250) 100 loops, best of 3: 2.03 ms per loop In []: timeit iut(500) 100 loops, best of 3: 5.3 ms per loop In []: timeit iut(1000) 100 loops, best of 3: 13.9 ms per loop In []: timeit iut(2000) 10 loops, best of 3: 39.8 ms per loop
Even the naive implementation is very slow, but allready outperforms triu_indices, when n is > 250!
So finally my question is how one could substantially improve the performance of indices calculations?
What's the timing of this version (taken from nitime) ? it builds a full intermediate array m = np.ones((n,n),int) a = np.triu(m,k) np.where(a != 0)
n=5 m = np.ones((n,n),int) np.where(np.triu(m,0) != 0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
or maybe variations on it that all build a full intermediate matrix
np.where(1-np.tri(n,n,-1)) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
np.where(np.subtract.outer(np.arange(n), np.arange(n))<=0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Josef
Regards, eat
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Jan 24, 2011 at 6:49 PM, <josef.pktd@gmail.com> wrote:
On Mon, Jan 24, 2011 at 4:29 PM, eat <e.antero.tammi@gmail.com> wrote:
Hi,
Running on: In []: np.__version__ Out[]: '1.5.1' In []: sys.version Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]'
For the reference: In []: X= randn(10, 125) In []: timeit dot(X.T, X) 10000 loops, best of 3: 170 us per loop In []: X= randn(10, 250) In []: timeit dot(X.T, X) 1000 loops, best of 3: 671 us per loop In []: X= randn(10, 500) In []: timeit dot(X.T, X) 100 loops, best of 3: 5.15 ms per loop In []: X= randn(10, 1000) In []: timeit dot(X.T, X) 100 loops, best of 3: 20 ms per loop In []: X= randn(10, 2000) In []: timeit dot(X.T, X) 10 loops, best of 3: 80.7 ms per loop
Performance of triu_indices: In []: timeit triu_indices(125) 1000 loops, best of 3: 662 us per loop In []: timeit triu_indices(250) 100 loops, best of 3: 2.55 ms per loop In []: timeit triu_indices(500) 100 loops, best of 3: 15 ms per loop In []: timeit triu_indices(1000) 10 loops, best of 3: 59.8 ms per loop In []: timeit triu_indices(2000) 1 loops, best of 3: 239 ms per loop
So the tri*_indices calculations seems to be unreasonable slow compared to for example calculations of inner products.
Now, just to compare for a very naive implementation of triu indices. In []: def iut(n): ..: r= np.empty(n* (n+ 1)/ 2, dtype= int) ..: c= r.copy() ..: a= np.arange(n) ..: m= 0 ..: for i in xrange(n): ..: ni= n- i ..: mni= m+ ni ..: r[m: mni]= i ..: c[m: mni]= a[i: n] ..: m+= ni ..: return (r, c) ..:
Are we really calculating the same thing? In []: triu_indices(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4])) In []: iut(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Seems so, and then its performance: In []: timeit iut(125) 1000 loops, best of 3: 992 us per loop In []: timeit iut(250) 100 loops, best of 3: 2.03 ms per loop In []: timeit iut(500) 100 loops, best of 3: 5.3 ms per loop In []: timeit iut(1000) 100 loops, best of 3: 13.9 ms per loop In []: timeit iut(2000) 10 loops, best of 3: 39.8 ms per loop
Even the naive implementation is very slow, but allready outperforms triu_indices, when n is > 250!
So finally my question is how one could substantially improve the performance of indices calculations?
What's the timing of this version (taken from nitime) ? it builds a full intermediate array
I should have checked the numpy source first, that's exactly the implementation of triu_indices in numpy 1.5.1 Josef
m = np.ones((n,n),int) a = np.triu(m,k) np.where(a != 0)
n=5 m = np.ones((n,n),int) np.where(np.triu(m,0) != 0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
or maybe variations on it that all build a full intermediate matrix
np.where(1-np.tri(n,n,-1)) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
np.where(np.subtract.outer(np.arange(n), np.arange(n))<=0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Josef
Regards, eat
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, On Tue, Jan 25, 2011 at 1:49 AM, <josef.pktd@gmail.com> wrote:
Hi,
Running on: In []: np.__version__ Out[]: '1.5.1' In []: sys.version Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)]'
For the reference: In []: X= randn(10, 125) In []: timeit dot(X.T, X) 10000 loops, best of 3: 170 us per loop In []: X= randn(10, 250) In []: timeit dot(X.T, X) 1000 loops, best of 3: 671 us per loop In []: X= randn(10, 500) In []: timeit dot(X.T, X) 100 loops, best of 3: 5.15 ms per loop In []: X= randn(10, 1000) In []: timeit dot(X.T, X) 100 loops, best of 3: 20 ms per loop In []: X= randn(10, 2000) In []: timeit dot(X.T, X) 10 loops, best of 3: 80.7 ms per loop
Performance of triu_indices: In []: timeit triu_indices(125) 1000 loops, best of 3: 662 us per loop In []: timeit triu_indices(250) 100 loops, best of 3: 2.55 ms per loop In []: timeit triu_indices(500) 100 loops, best of 3: 15 ms per loop In []: timeit triu_indices(1000) 10 loops, best of 3: 59.8 ms per loop In []: timeit triu_indices(2000) 1 loops, best of 3: 239 ms per loop
So the tri*_indices calculations seems to be unreasonable slow compared to for example calculations of inner products.
Now, just to compare for a very naive implementation of triu indices. In []: def iut(n): ..: r= np.empty(n* (n+ 1)/ 2, dtype= int) ..: c= r.copy() ..: a= np.arange(n) ..: m= 0 ..: for i in xrange(n): ..: ni= n- i ..: mni= m+ ni ..: r[m: mni]= i ..: c[m: mni]= a[i: n] ..: m+= ni ..: return (r, c) ..:
Are we really calculating the same thing? In []: triu_indices(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4])) In []: iut(5) Out[]: (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Seems so, and then its performance: In []: timeit iut(125) 1000 loops, best of 3: 992 us per loop In []: timeit iut(250) 100 loops, best of 3: 2.03 ms per loop In []: timeit iut(500) 100 loops, best of 3: 5.3 ms per loop In []: timeit iut(1000) 100 loops, best of 3: 13.9 ms per loop In []: timeit iut(2000) 10 loops, best of 3: 39.8 ms per loop
Even the naive implementation is very slow, but allready outperforms triu_indices, when n is > 250!
So finally my question is how one could substantially improve the
On Mon, Jan 24, 2011 at 4:29 PM, eat <e.antero.tammi@gmail.com> wrote: performance
of indices calculations?
What's the timing of this version (taken from nitime) ? it builds a full intermediate array
m = np.ones((n,n),int) a = np.triu(m,k) np.where(a != 0)
n=5 m = np.ones((n,n),int) np.where(np.triu(m,0) != 0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
This is ~20% slower than triu_indices.
or maybe variations on it that all build a full intermediate matrix
np.where(1-np.tri(n,n,-1)) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
This ~5% faster than triu_indicies.
np.where(np.subtract.outer(np.arange(n), np.arange(n))<=0) (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]), array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
Clever, and its 50% faster than triu_indices, but the naive implementaion is still almost 3x faster. However it seems that some 80% of time is spent in where. So simple subtract.outer(arange(n), arange(n))<= 0 and logical indenxing outperforms slightly the naive version. I was hoping to find a way to do this at least 10x faster than naive implementation, but meanwhile I'll stick with logical indexing. Thanks, eat
Josef
Regards, eat
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
E. Antero Tammi -
eat -
josef.pktd@gmail.com