[Numpy-discussion] How to improve performance of slow tri*_indices calculations?

Tue Jan 25 08:08:26 EST 2011

Hi,

On Tue, Jan 25, 2011 at 1:49 AM, <josef.pktd at gmail.com> wrote:

> On Mon, Jan 24, 2011 at 4:29 PM, eat <e.antero.tammi at gmail.com> wrote:
> > Hi,
> >
> > Running on:
> > In []: np.__version__
> > Out[]: '1.5.1'
> > In []: sys.version
> > Out[]: '2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit
> (Intel)]'
> >
> > For the reference:
> > In []: X= randn(10, 125)
> > In []: timeit dot(X.T, X)
> > 10000 loops, best of 3: 170 us per loop
> > In []: X= randn(10, 250)
> > In []: timeit dot(X.T, X)
> > 1000 loops, best of 3: 671 us per loop
> > In []: X= randn(10, 500)
> > In []: timeit dot(X.T, X)
> > 100 loops, best of 3: 5.15 ms per loop
> > In []: X= randn(10, 1000)
> > In []: timeit dot(X.T, X)
> > 100 loops, best of 3: 20 ms per loop
> > In []: X= randn(10, 2000)
> > In []: timeit dot(X.T, X)
> > 10 loops, best of 3: 80.7 ms per loop
> >
> > Performance of triu_indices:
> > In []: timeit triu_indices(125)
> > 1000 loops, best of 3: 662 us per loop
> > In []: timeit triu_indices(250)
> > 100 loops, best of 3: 2.55 ms per loop
> > In []: timeit triu_indices(500)
> > 100 loops, best of 3: 15 ms per loop
> > In []: timeit triu_indices(1000)
> > 10 loops, best of 3: 59.8 ms per loop
> > In []: timeit triu_indices(2000)
> > 1 loops, best of 3: 239 ms per loop
> >
> > So the tri*_indices calculations seems to be unreasonable slow compared
> to for
> > example calculations of inner products.
> >
> > Now, just to compare for a very naive implementation of triu indices.
> > In []: def iut(n):
> >   ..:     r= np.empty(n* (n+ 1)/ 2, dtype= int)
> >   ..:     c= r.copy()
> >   ..:     a= np.arange(n)
> >   ..:     m= 0
> >   ..:     for i in xrange(n):
> >   ..:         ni= n- i
> >   ..:         mni= m+ ni
> >   ..:         r[m: mni]= i
> >   ..:         c[m: mni]= a[i: n]
> >   ..:         m+= ni
> >   ..:     return (r, c)
> >   ..:
> >
> > Are we really calculating the same thing?
> > In []: triu_indices(5)
> > Out[]:
> > (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]),
> >  array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
> > In []: iut(5)
> > Out[]:
> > (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]),
> >  array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))
> >
> > Seems so, and then its performance:
> > In []: timeit iut(125)
> > 1000 loops, best of 3: 992 us per loop
> > In []: timeit iut(250)
> > 100 loops, best of 3: 2.03 ms per loop
> > In []: timeit iut(500)
> > 100 loops, best of 3: 5.3 ms per loop
> > In []: timeit iut(1000)
> > 100 loops, best of 3: 13.9 ms per loop
> > In []: timeit iut(2000)
> > 10 loops, best of 3: 39.8 ms per loop
> >
> > Even the naive implementation is very slow, but allready outperforms
> > triu_indices, when n is > 250!
> >
> > So finally my question is how one could substantially improve the
> performance
> > of indices calculations?
>
> What's the timing of this version (taken from nitime) ? it builds a
> full intermediate array
>
>    m = np.ones((n,n),int)
>    a = np.triu(m,k)
>    np.where(a != 0)
>
> >>> n=5
> >>> m = np.ones((n,n),int)
> >>> np.where(np.triu(m,0) != 0)
> (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]),
> array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))

This is ~20% slower than triu_indices.

>
> or maybe variations on it that all build a full intermediate matrix
>
> >>> np.where(1-np.tri(n,n,-1))
> (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]),
> array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))

This ~5% faster than triu_indicies.

>
> >>> np.where(np.subtract.outer(np.arange(n), np.arange(n))<=0)
> (array([0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]),
> array([0, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4]))

Clever, and its 50% faster than triu_indices, but the naive implementaion is
still almost 3x faster.

However it seems that some 80% of time is spent in where. So simple
subtract.outer(arange(n), arange(n))<= 0 and logical indenxing outperforms
slightly the naive version.

I was hoping to find a way to do this at least 10x faster than naive
implementation, but meanwhile I'll stick with logical indexing.

Thanks,
eat

>
> Josef
>
> >
> >
> > Regards,
> > eat
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110125/ce5a891d/attachment.html>