Mailman 3 Slicing slower than matrix multiplication? - NumPy-Discussion

newer
tracking additions to NumPy across...

Slicing slower than matrix multiplication?

Jasper van de Gronde

11 Dec 2009 11 Dec '09

3:49 a.m.

(Resending without attachment as I don't think my previous message arrived.) I just started using numpy and am very, very pleased with the functionality and cleanness so far. However, I tried what I though would be a simple optimization and found that the opposite was true. Specifically, I had a loop where something like this was done: w += Xi[mu,:] E = np.dot(Xi,w) Instead of repeatedly doing the matrix product I thought I'd do the matrix product just once, before the loop, compute the product np.dot(Xi,Xi.T) and then do: w += Xi[mu,:] E += Xi2[mu,:] Seems like a clear winner, instead of doing a matrix multiplication it simply has to sum two vectors (in-place). However, it turned out to be 1.5 times SLOWER... I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got: Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953 Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

Show replies by date

Dag Sverre Seljebotn

11 Dec 11 Dec

4:01 a.m.

Jasper van de Gronde wrote:

...

(Resending without attachment as I don't think my previous message arrived.)

I just started using numpy and am very, very pleased with the functionality and cleanness so far. However, I tried what I though would be a simple optimization and found that the opposite was true. Specifically, I had a loop where something like this was done:

w += Xi[mu,:] E = np.dot(Xi,w)

Instead of repeatedly doing the matrix product I thought I'd do the matrix product just once, before the loop, compute the product np.dot(Xi,Xi.T) and then do:

w += Xi[mu,:] E += Xi2[mu,:]

Seems like a clear winner, instead of doing a matrix multiplication it simply has to sum two vectors (in-place). However, it turned out to be 1.5 times SLOWER...

I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead. Dag Sverre

Jasper van de Gronde

5:55 a.m.

Dag Sverre Seljebotn wrote:

...

Jasper van de Gronde wrote:

...
I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead.

Originally I had attached a Python file demonstrating the problem, but apparently this wasn't accepted by the list. In any case, the matrices and vectors weren't too big (60x20), so I tried making them bigger and indeed the "fast" version was now considerably faster. But still, this seems like a very odd difference. I know Python is an interpreted language and has a lot of overhead, but still, selecting a row/column shouldn't be THAT slow, should it? To be clear, this is the code I used for testing: -------------------------------------------------------------------- import timeit setupCode = """ import numpy as np P = 60 N = 20 Xi = np.random.standard_normal((P,N)) w = np.random.standard_normal((N)) Xi2 = np.dot(Xi,Xi.T) E = np.dot(Xi,w) """ N = 10000 dotProduct = timeit.Timer('E = np.dot(Xi,w)',setupCode) additionRow = timeit.Timer('E += Xi2[P/2,:]',setupCode) additionCol = timeit.Timer('E += Xi2[:,P/2]',setupCode) print "Dot product: %f" % dotProduct.timeit(N) print "Add a row: %f" % additionRow.timeit(N) print "Add a column: %f" % additionCol.timeit(N) --------------------------------------------------------------------

Dag Sverre Seljebotn

7:44 a.m.

Jasper van de Gronde wrote:

...

Dag Sverre Seljebotn wrote:

...
Jasper van de Gronde wrote:

...
I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead.

Originally I had attached a Python file demonstrating the problem, but apparently this wasn't accepted by the list. In any case, the matrices and vectors weren't too big (60x20), so I tried making them bigger and indeed the "fast" version was now considerably faster.

60x20 is "nothing", so a full matrix multiplication or a single matrix-vector probably takes the same time (that is, the difference between them in itself is likely smaller than the error you make during measuring). In this context, the benchmarks will be completely dominated by the number of Python calls you make (each, especially taking the slice, means allocating Python objects, calling a bunch of functions in C, etc. etc). So it's not that strange, taking a slice isn't free, some Python objects must be created etc. etc. I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art. Dag Sverre

...

But still, this seems like a very odd difference. I know Python is an interpreted language and has a lot of overhead, but still, selecting a row/column shouldn't be THAT slow, should it? To be clear, this is the code I used for testing: -------------------------------------------------------------------- import timeit

setupCode = """ import numpy as np

P = 60 N = 20

Xi = np.random.standard_normal((P,N)) w = np.random.standard_normal((N)) Xi2 = np.dot(Xi,Xi.T) E = np.dot(Xi,w) """

N = 10000

dotProduct = timeit.Timer('E = np.dot(Xi,w)',setupCode) additionRow = timeit.Timer('E += Xi2[P/2,:]',setupCode) additionCol = timeit.Timer('E += Xi2[:,P/2]',setupCode) print "Dot product: %f" % dotProduct.timeit(N) print "Add a row: %f" % additionRow.timeit(N) print "Add a column: %f" % additionCol.timeit(N) -------------------------------------------------------------------- _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Francesc Alted

8:03 a.m.

A Friday 11 December 2009 16:44:29 Dag Sverre Seljebotn escrigué:

...

Jasper van de Gronde wrote:

...
Dag Sverre Seljebotn wrote:

...
Jasper van de Gronde wrote:

...
I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead.

Originally I had attached a Python file demonstrating the problem, but apparently this wasn't accepted by the list. In any case, the matrices and vectors weren't too big (60x20), so I tried making them bigger and indeed the "fast" version was now considerably faster.

60x20 is "nothing", so a full matrix multiplication or a single matrix-vector probably takes the same time (that is, the difference between them in itself is likely smaller than the error you make during measuring).

In this context, the benchmarks will be completely dominated by the number of Python calls you make (each, especially taking the slice, means allocating Python objects, calling a bunch of functions in C, etc. etc). So it's not that strange, taking a slice isn't free, some Python objects must be created etc. etc.

Yeah, I think taking slices here is taking quite a lot of time: In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing & analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

...

I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper? -- Francesc Alted

Bruce Southey

8:36 a.m.

On 12/11/2009 10:03 AM, Francesc Alted wrote:

...

A Friday 11 December 2009 16:44:29 Dag Sverre Seljebotn escrigué:

...
Jasper van de Gronde wrote:

...
Dag Sverre Seljebotn wrote:

...
Jasper van de Gronde wrote:

...
I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead.

Originally I had attached a Python file demonstrating the problem, but apparently this wasn't accepted by the list. In any case, the matrices and vectors weren't too big (60x20), so I tried making them bigger and indeed the "fast" version was now considerably faster.

60x20 is "nothing", so a full matrix multiplication or a single matrix-vector probably takes the same time (that is, the difference between them in itself is likely smaller than the error you make during measuring).

In this context, the benchmarks will be completely dominated by the number of Python calls you make (each, especially taking the slice, means allocating Python objects, calling a bunch of functions in C, etc. etc). So it's not that strange, taking a slice isn't free, some Python objects must be created etc. etc.

Yeah, I think taking slices here is taking quite a lot of time:

In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop

don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing& analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

...
I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper?

What are using actually trying to test here? I do not see any equivalence in the operations or output here. -With your slices you need two dot products but ultimately you are only using one for your dot product. -There are addition operations on the slices that are not present in the dot product. -The final E arrays are not the same for all three operations. Having said that, the more you can vectorize your function, the more efficient it will likely be especially with Atlas etc. Bruce

Francesc Alted

9:28 a.m.

A Friday 11 December 2009 17:36:54 Bruce Southey escrigué:

...

On 12/11/2009 10:03 AM, Francesc Alted wrote:

...
A Friday 11 December 2009 16:44:29 Dag Sverre Seljebotn escrigué:

...
Jasper van de Gronde wrote:

...
Dag Sverre Seljebotn wrote:

...
Jasper van de Gronde wrote:

...
I've attached a test file which shows the problem. It also tries adding columns instead of rows (in case the memory layout is playing tricks), but this seems to make no difference. This is the output I got:

Dot product: 5.188786 Add a row: 8.032767 Add a column: 8.070953

Any ideas on why adding a row (or column) of a matrix is slower than computing a matrix product with a similarly sized matrix... (Xi has less columns than Xi2, but just as many rows.)

I think we need some numbers to put this into context -- how big are the vectors/matrices? How many iterations was the loop run? If the vectors are small and the loop is run many times, how fast the operation "ought" to be is irrelevant as it would drown in Python overhead.

Originally I had attached a Python file demonstrating the problem, but apparently this wasn't accepted by the list. In any case, the matrices and vectors weren't too big (60x20), so I tried making them bigger and indeed the "fast" version was now considerably faster.

60x20 is "nothing", so a full matrix multiplication or a single matrix-vector probably takes the same time (that is, the difference between them in itself is likely smaller than the error you make during measuring).

In this context, the benchmarks will be completely dominated by the number of Python calls you make (each, especially taking the slice, means allocating Python objects, calling a bunch of functions in C, etc. etc). So it's not that strange, taking a slice isn't free, some Python objects must be created etc. etc.

Yeah, I think taking slices here is taking quite a lot of time:

In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop

don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing& analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

...
I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper?

What are using actually trying to test here?

Good question. I don't know for sure :-)

...

I do not see any equivalence in the operations or output here. -With your slices you need two dot products but ultimately you are only using one for your dot product. -There are addition operations on the slices that are not present in the dot product. -The final E arrays are not the same for all three operations.

I don't understand the ultimate goal of the OP either, but what caught my attention was that: In [74]: timeit Xi2[P/2] 1000000 loops, best of 3: 278 ns per loop In [75]: timeit Xi2[P/2,:] 1000000 loops, best of 3: 1.04 µs per loop i.e. adding an additional parameter (the ':') to the slice, drives the time to run almost 4x slower. And with this, it is *partially* explained the problem exposed by OP, i.e.: In [77]: timeit np.dot(Xi,w) 100000 loops, best of 3: 2.91 µs per loop In [78]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.05 µs per loop In [79]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.81 µs per loop But again, don't ask me whether the results are okay or not. I'm playing here the role of a pure computational scientist on a very concrete problem ;-)

...

Having said that, the more you can vectorize your function, the more efficient it will likely be especially with Atlas etc.

Except if your arrays are small enough, which is the underlying issue here IMO. -- Francesc Alted

David Cournapeau

8:14 p.m.

On Fri, Dec 11, 2009 at 10:06 PM, Bruce Southey <bsouthey@gmail.com> wrote:

...

Having said that, the more you can vectorize your function, the more efficient it will likely be especially with Atlas etc.

One thing to note is that dot uses optimized atlas if available, which makes it quite faster than equivalent operations you would do using purely numpy. I doubt that's the reason here, since the arrays are small, but that's something to keep in mind when performances matter: use dot wherever possible, it is generally faster than prod/sum, cheers, David

Hoyt Koepke

8:26 p.m.

...

One thing to note is that dot uses optimized atlas if available, which makes it quite faster than equivalent operations you would do using purely numpy. I doubt that's the reason here, since the arrays are small, but that's something to keep in mind when performances matter: use dot wherever possible, it is generally faster than prod/sum,

This is quite true; I once had a very large matrix (600 x 200,000) that I needed to normalize. Using .sum( ) and /= took about 30 minutes. When I switched to using dot( ) to do the same operation (matrix multiplication with a vector of 1's, then turning that into a diagonal matrix and using dot() again to normalize it), it dropped the computation time down to about 2 minutes. Most of the gain was likely due to ATLAS using all the cores and numpy only using 1, but I was still impressed. --Hoyt ++++++++++++++++++++++++++++++++++++++++++++++++ + Hoyt Koepke + University of Washington Department of Statistics + http://www.stat.washington.edu/~hoytak/ + hoytak@gmail.com ++++++++++++++++++++++++++++++++++++++++++

Jasper van de Gronde

12 Dec 12 Dec

3:59 a.m.

Francesc Alted wrote:

...

... Yeah, I think taking slices here is taking quite a lot of time:

In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop

don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing & analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

This is indeed interesting! And very nice that this actually works the way you'd expect it to. I guess I've just worked too long with Matlab :)

...

...
I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper?

I had a look, these are the timings for Python for 60x20: Dot product: 0.051165 (5.116467e-06 per iter) Add a row: 0.092849 (9.284860e-06 per iter) Add a column: 0.082523 (8.252348e-06 per iter) For Matlab 60x20: Dot product: 0.029927 (2.992664e-006 per iter) Add a row: 0.019664 (1.966444e-006 per iter) Add a column: 0.008384 (8.384376e-007 per iter) For Python 600x200: Dot product: 1.917235 (1.917235e-04 per iter) Add a row: 0.113243 (1.132425e-05 per iter) Add a column: 0.162740 (1.627397e-05 per iter) For Matlab 600x200: Dot product: 1.282778 (1.282778e-004 per iter) Add a row: 0.107252 (1.072525e-005 per iter) Add a column: 0.021325 (2.132527e-006 per iter) If I fit a line through these two data points (60 and 600 rows), I get the following equations: Python, AR: 3.8e-5 * n + 0.091 Matlab, AC: 2.4e-5 * n + 0.0069 This would suggest that Matlab performs the vector addition about 1.6 times faster and has a 13 times smaller constant cost! As for the questions about what I'm trying to compute, these tests are minimized as much as possible to show the bottleneck I encountered, they are part of a larger loop where it does make sense. In essence I'm iteratively adjusting w and E has to keep up (because that's what is used to determine the next change). Instead of recomputing E all the time based on E = Xi*w a little linear algebra shows that the vector addition is sufficient.

Bruce Southey

7:17 a.m.

On Sat, Dec 12, 2009 at 5:59 AM, Jasper van de Gronde <th.v.d.gronde@hccnet.nl> wrote:

...

Francesc Alted wrote:

...
... Yeah, I think taking slices here is taking quite a lot of time:

In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop

don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing & analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

This is indeed interesting! And very nice that this actually works the way you'd expect it to. I guess I've just worked too long with Matlab :)

...
...
I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper?

I had a look, these are the timings for Python for 60x20: Dot product: 0.051165 (5.116467e-06 per iter) Add a row: 0.092849 (9.284860e-06 per iter) Add a column: 0.082523 (8.252348e-06 per iter) For Matlab 60x20: Dot product: 0.029927 (2.992664e-006 per iter) Add a row: 0.019664 (1.966444e-006 per iter) Add a column: 0.008384 (8.384376e-007 per iter) For Python 600x200: Dot product: 1.917235 (1.917235e-04 per iter) Add a row: 0.113243 (1.132425e-05 per iter) Add a column: 0.162740 (1.627397e-05 per iter) For Matlab 600x200: Dot product: 1.282778 (1.282778e-004 per iter) Add a row: 0.107252 (1.072525e-005 per iter) Add a column: 0.021325 (2.132527e-006 per iter)

If I fit a line through these two data points (60 and 600 rows), I get the following equations: Python, AR: 3.8e-5 * n + 0.091 Matlab, AC: 2.4e-5 * n + 0.0069 This would suggest that Matlab performs the vector addition about 1.6 times faster and has a 13 times smaller constant cost!

As for the questions about what I'm trying to compute, these tests are minimized as much as possible to show the bottleneck I encountered, they are part of a larger loop where it does make sense. In essence I'm iteratively adjusting w and E has to keep up (because that's what is used to determine the next change). Instead of recomputing E all the time based on E = Xi*w a little linear algebra shows that the vector addition is sufficient. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Well, I think the difference between Matlab and numpy have been well discussed elsewhere (http://www.scipy.org/NumPy_for_Matlab_Users) especially the C/Fortran order difference. Clearly you are not using the same level of optimized libraries because of the differences shown. Unfortunately your code does not distinguish between the dot products and the slicing so slower dot product rules your times. Really you need to just compare your slicing alone without any dot product or without the inplace addition. Really I would suggest asking the list for the real problem because it is often amazing what solutions have been given. Bruce

Jasper van de Gronde

13 Dec 13 Dec

3:13 a.m.

Bruce Southey wrote:

...

Really I would suggest asking the list for the real problem because it is often amazing what solutions have been given.

So far this is the fastest code I've got: ------------------------------------------------------------------------ import numpy as np nmax = 100 def minover(Xi,S): P,N = Xi.shape SXi = Xi.copy() for i in xrange(0,P): SXi[i] *= S[i] SXi2 = np.dot(SXi,SXi.T) SXiSXi2divN = np.concatenate((SXi,SXi2),axis=1)/N w = np.random.standard_normal((N)) E = np.dot(SXi,w) wE = np.concatenate((w,E)) for s in xrange(0,nmax*P): mu = wE[N:].argmin() wE += SXiSXi2divN[mu] # E' = dot(SXi,w') # = dot(SXi,w + SXi[mu,:]/N) # = dot(SXi,w) + dot(SXi,SXi[mu,:])/N # = E + dot(SXi,SXi.T)[:,mu]/N # = E + dot(SXi,SXi.T)[mu,:]/N return wE[:N] ------------------------------------------------------------------------ I am particularly interested in cleaning up the initialization part, but any suggestions for improving the overall performance are of course appreciated.

Bruce Southey

14 Dec 14 Dec

7:01 a.m.

On 12/13/2009 05:13 AM, Jasper van de Gronde wrote:

...

Bruce Southey wrote:

...
Really I would suggest asking the list for the real problem because it is often amazing what solutions have been given.

So far this is the fastest code I've got: ------------------------------------------------------------------------ import numpy as np

nmax = 100

def minover(Xi,S): P,N = Xi.shape SXi = Xi.copy() for i in xrange(0,P): SXi[i] *= S[i] SXi2 = np.dot(SXi,SXi.T) SXiSXi2divN = np.concatenate((SXi,SXi2),axis=1)/N w = np.random.standard_normal((N)) E = np.dot(SXi,w) wE = np.concatenate((w,E)) for s in xrange(0,nmax*P): mu = wE[N:].argmin() wE += SXiSXi2divN[mu] # E' = dot(SXi,w') # = dot(SXi,w + SXi[mu,:]/N) # = dot(SXi,w) + dot(SXi,SXi[mu,:])/N # = E + dot(SXi,SXi.T)[:,mu]/N # = E + dot(SXi,SXi.T)[mu,:]/N return wE[:N] ------------------------------------------------------------------------

I am particularly interested in cleaning up the initialization part, but any suggestions for improving the overall performance are of course appreciated.

What is Xi and S? I think that your SXi is just: SXi=Xi*S But really I do not understand what you are actually trying to do. As previously indicated, some times simplifying an algorithm can make it computationally slower. Bruce

Jasper van de Gronde

9:27 a.m.

Bruce Southey wrote:

...

...
So far this is the fastest code I've got: ------------------------------------------------------------------------ import numpy as np

nmax = 100

def minover(Xi,S): P,N = Xi.shape SXi = Xi.copy() for i in xrange(0,P): SXi[i] *= S[i] SXi2 = np.dot(SXi,SXi.T) SXiSXi2divN = np.concatenate((SXi,SXi2),axis=1)/N w = np.random.standard_normal((N)) E = np.dot(SXi,w) wE = np.concatenate((w,E)) for s in xrange(0,nmax*P): mu = wE[N:].argmin() wE += SXiSXi2divN[mu] # E' = dot(SXi,w') # = dot(SXi,w + SXi[mu,:]/N) # = dot(SXi,w) + dot(SXi,SXi[mu,:])/N # = E + dot(SXi,SXi.T)[:,mu]/N # = E + dot(SXi,SXi.T)[mu,:]/N return wE[:N] ------------------------------------------------------------------------

I am particularly interested in cleaning up the initialization part, but any suggestions for improving the overall performance are of course appreciated.

What is Xi and S? I think that your SXi is just: SXi=Xi*S

Sort of, it's actually (Xi.T*S).T, now that I think of it... I'll see if that is any faster. And if there is a neater way of doing it I'd love to hear about it.

...

But really I do not understand what you are actually trying to do. As previously indicated, some times simplifying an algorithm can make it computationally slower.

It was hardly simplified, this was the original function body: P,N = Xi.shape SXi = Xi.copy() for i in xrange(0,P): SXi[i] *= S[i] w = np.random.standard_normal((N)) for s in xrange(0,nmax*P): E = np.dot(SXi,w) mu = E.argmin() w += SXi[mu]/N return w As you can see it's basically some basic linear algebra (which reduces the time complexity from about O(n^3) to O(n^2)), plus some less nice tweaks to avoid the high Python overhead.

Charles R Harris

12:46 p.m.

On Mon, Dec 14, 2009 at 10:27 AM, Jasper van de Gronde < th.v.d.gronde@hccnet.nl> wrote:

...

Bruce Southey wrote:

...
...
So far this is the fastest code I've got: ------------------------------------------------------------------------ import numpy as np

nmax = 100

def minover(Xi,S): P,N = Xi.shape SXi = Xi.copy() for i in xrange(0,P): SXi[i] *= S[i] SXi2 = np.dot(SXi,SXi.T) SXiSXi2divN = np.concatenate((SXi,SXi2),axis=1)/N w = np.random.standard_normal((N)) E = np.dot(SXi,w) wE = np.concatenate((w,E)) for s in xrange(0,nmax*P): mu = wE[N:].argmin() wE += SXiSXi2divN[mu] # E' = dot(SXi,w') # = dot(SXi,w + SXi[mu,:]/N) # = dot(SXi,w) + dot(SXi,SXi[mu,:])/N # = E + dot(SXi,SXi.T)[:,mu]/N # = E + dot(SXi,SXi.T)[mu,:]/N return wE[:N] ------------------------------------------------------------------------

I am particularly interested in cleaning up the initialization part, but any suggestions for improving the overall performance are of course appreciated.

What is Xi and S? I think that your SXi is just: SXi=Xi*S

Sort of, it's actually (Xi.T*S).T, now that I think of it... I'll see if that is any faster. And if there is a neater way of doing it I'd love to hear about it.

Xi*S[:,newaxis] Chuck

Jasper van de Gronde

15 Dec 15 Dec

2:48 a.m.

Charles R Harris wrote:

...

Sort of, it's actually (Xi.T*S).T, now that I think of it... I'll see if that is any faster. And if there is a neater way of doing it I'd love to hear about it.

Xi*S[:,newaxis]

Thanks! (Obviously doesn't matter much in terms of performance, as it's only the initialization, but it definitely does make for nicer code.)

Francesc Alted

14 Dec 14 Dec

8:09 a.m.

A Saturday 12 December 2009 12:59:16 Jasper van de Gronde escrigué:

...

Francesc Alted wrote:

...
... Yeah, I think taking slices here is taking quite a lot of time:

In [58]: timeit E + Xi2[P/2,:] 100000 loops, best of 3: 3.95 µs per loop

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.17 µs per loop

don't know why the additional ',:' in the slice is taking so much time, but my guess is that passing & analyzing the second argument (slice(None,None,None)) could be the responsible for the slowdown (but that is taking too much time). Mmh, perhaps it would be worth to study this more carefully so that an optimization could be done in NumPy.

This is indeed interesting! And very nice that this actually works the way you'd expect it to. I guess I've just worked too long with Matlab :)

...
...
I think the lesson mostly should be that with so little data, benchmarking becomes a very difficult art.

Well, I think it is not difficult, it is just that you are perhaps benchmarking Python/NumPy machinery instead ;-) I'm curious whether Matlab can do slicing much more faster than NumPy. Jasper?

I had a look, these are the timings for Python for 60x20: Dot product: 0.051165 (5.116467e-06 per iter) Add a row: 0.092849 (9.284860e-06 per iter) Add a column: 0.082523 (8.252348e-06 per iter) For Matlab 60x20: Dot product: 0.029927 (2.992664e-006 per iter) Add a row: 0.019664 (1.966444e-006 per iter) Add a column: 0.008384 (8.384376e-007 per iter) For Python 600x200: Dot product: 1.917235 (1.917235e-04 per iter) Add a row: 0.113243 (1.132425e-05 per iter) Add a column: 0.162740 (1.627397e-05 per iter) For Matlab 600x200: Dot product: 1.282778 (1.282778e-004 per iter) Add a row: 0.107252 (1.072525e-005 per iter) Add a column: 0.021325 (2.132527e-006 per iter)

If I fit a line through these two data points (60 and 600 rows), I get the following equations: Python, AR: 3.8e-5 * n + 0.091 Matlab, AC: 2.4e-5 * n + 0.0069 This would suggest that Matlab performs the vector addition about 1.6 times faster and has a 13 times smaller constant cost!

The things seems to be worst than 1.6x times slower for numpy, as matlab orders arrays by column, while numpy order is by row. So, if we want to compare pears with pears: For Python 600x200: Add a row: 0.113243 (1.132425e-05 per iter) For Matlab 600x200: Add a column: 0.021325 (2.132527e-006 per iter) which makes numpy 5x slower than matlab. Hmm, I definitely think that numpy could do better here :-/ However, caveat emptor, when you do timings, you normally put your code snippets in loops, and after the first iteration, the dataset (if small enough, as in your examples above) lives in CPU caches. But this is not *usually* the case because you have to transmit your data to CPU first. This transmission process is normally the main bottleneck when doing BLAS-1 level operations (i.e. vector-vector). This is to say that, in real-life calculations your numpy code will work almost as fast as matlab. So, my adivce is: don't be too worried about small dataset speed in small loops, and concentrate your optimization efforts in making your *real* code faster. -- Francesc Alted

Francesc Alted

8:22 a.m.

A Monday 14 December 2009 17:09:13 Francesc Alted escrigué:

...

The things seems to be worst than 1.6x times slower for numpy, as matlab orders arrays by column, while numpy order is by row. So, if we want to compare pears with pears:

For Python 600x200: Add a row: 0.113243 (1.132425e-05 per iter) For Matlab 600x200: Add a column: 0.021325 (2.132527e-006 per iter)

Mmh, I've repeated this benchmark on my machine and got: In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.8 µs per loop that is, very similar to matlab's 2.1 µs and quite far from the 11 µs you are getting for numpy in your machine... I'm using a Core2 @ 3 GHz. -- Francesc Alted

Jasper van de Gronde

9:20 a.m.

Francesc Alted wrote:

...

A Monday 14 December 2009 17:09:13 Francesc Alted escrigué:

...
The things seems to be worst than 1.6x times slower for numpy, as matlab orders arrays by column, while numpy order is by row. So, if we want to compare pears with pears:

For Python 600x200: Add a row: 0.113243 (1.132425e-05 per iter) For Matlab 600x200: Add a column: 0.021325 (2.132527e-006 per iter)

Mmh, I've repeated this benchmark on my machine and got:

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.8 µs per loop

that is, very similar to matlab's 2.1 µs and quite far from the 11 µs you are getting for numpy in your machine... I'm using a Core2 @ 3 GHz.

I'm using Python 2.6 and numpy 1.4.0rc1 on a Core2 @ 1.33 GHz (notebook). I'll have a look later to see if upgrading Python to 2.6.4 makes a difference.

Francesc Alted

9:51 a.m.

A Monday 14 December 2009 18:20:32 Jasper van de Gronde escrigué:

...

Francesc Alted wrote:

...
A Monday 14 December 2009 17:09:13 Francesc Alted escrigué:

...
The things seems to be worst than 1.6x times slower for numpy, as matlab orders arrays by column, while numpy order is by row. So, if we want to compare pears with pears:

For Python 600x200: Add a row: 0.113243 (1.132425e-05 per iter) For Matlab 600x200: Add a column: 0.021325 (2.132527e-006 per iter)

Mmh, I've repeated this benchmark on my machine and got:

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.8 µs per loop

that is, very similar to matlab's 2.1 µs and quite far from the 11 µs you are getting for numpy in your machine... I'm using a Core2 @ 3 GHz.

I'm using Python 2.6 and numpy 1.4.0rc1 on a Core2 @ 1.33 GHz (notebook). I'll have a look later to see if upgrading Python to 2.6.4 makes a difference.

I don't think so. Your machine is slow for nowadays standards, so the 5x slowness should be due to python/numpy overhead, but unfortunately nothing that could be solved magically by using a newer python/numpy version. -- Francesc Alted

josef.pktd＠gmail.com

10:26 a.m.

On Mon, Dec 14, 2009 at 12:51 PM, Francesc Alted <faltet@pytables.org> wrote:

...

A Monday 14 December 2009 18:20:32 Jasper van de Gronde escrigué:

...
Francesc Alted wrote:

...
A Monday 14 December 2009 17:09:13 Francesc Alted escrigué:

...
The things seems to be worst than 1.6x times slower for numpy, as matlab orders arrays by column, while numpy order is by row. So, if we want to compare pears with pears:

For Python 600x200: Add a row: 0.113243 (1.132425e-05 per iter) For Matlab 600x200: Add a column: 0.021325 (2.132527e-006 per iter)

Mmh, I've repeated this benchmark on my machine and got:

In [59]: timeit E + Xi2[P/2] 100000 loops, best of 3: 2.8 µs per loop

that is, very similar to matlab's 2.1 µs and quite far from the 11 µs you are getting for numpy in your machine... I'm using a Core2 @ 3 GHz.

I'm using Python 2.6 and numpy 1.4.0rc1 on a Core2 @ 1.33 GHz (notebook). I'll have a look later to see if upgrading Python to 2.6.4 makes a difference.

I don't think so. Your machine is slow for nowadays standards, so the 5x slowness should be due to python/numpy overhead, but unfortunately nothing that could be solved magically by using a newer python/numpy version.

dot is slow on single cpu, older notebook with older atlas and low in memory, (dot cannot multi-process). it looks like adding a row is almost only overhead for 600x200

...

...
...
print "Dot product: %f" % dotProduct.timeit(N) Dot product: 3.124008 print "Add a row: %f" % additionRow.timeit(N) Add a row: 0.080612 print "Add a column: %f" % additionCol.timeit(N) Add a column: 0.113229

for 60x20

...

...
...
print "Dot product: %f" % dotProduct.timeit(N) Dot product: 0.070933 print "Add a row: %f" % additionRow.timeit(N) Add a row: 0.058492 print "Add a column: %f" % additionCol.timeit(N) Add a column: 0.061401

600x2000 (dot may induce swapping to disc)

...

...
...
print "Dot product: %f" % dotProduct.timeit(N) Dot product: 43.114585 print "Add a row: %f" % additionRow.timeit(N) Add a row: 0.085261 print "Add a column: %f" % additionCol.timeit(N) Add a column: 0.122754 print "Dot product: %f" % dotProduct.timeit(N) Dot product: 35.232084

Josef

...

-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Pauli Virtanen

15 Dec 15 Dec

1:36 a.m.

Mon, 14 Dec 2009 17:09:13 +0100, Francesc Alted wrote: [clip]

...

which makes numpy 5x slower than matlab. Hmm, I definitely think that numpy could do better here :-/

It could be useful to track down what exactly is slow, by profiling the actual C code. Unfortunately, profiling shared libraries is somewhat difficult. Some tools that I've seen to work (on Linux): - Valgrind (+ KCacheGrind) Together with its cache profiler, this can give useful information on what is the slow part, and on which lines most of the time is spent. - Oprofile Nice sample-based profiler, but requires root. - Qprof (32-bit only) Good for quick sample-based profiling on function level. Easy to use. - Sprof "The" way to profile dynamically linked libraries on Linux. Function-level, and slightly obscure to use. So if someone wants to spend time on this, those are the tools I'd recommend :) -- Pauli Virtanen

denis

16 Dec 16 Dec

9:15 a.m.

A general question, is there a collection of numpy code snippets as transformed by experts, short of google site:mail.scipy.org/pipermail/numpy-discussion or the like ? And a subquestion, does anyone have a list of algebraic identities for .T vstack etc ? for a real example, to transform dots = np.vstack([ dot( x[j:j+4] .T, imatrix ) .T for j in ...]) cheers -- denis

5268

Age (days ago)

5273

Last active (days ago)

List overview

Download

22 comments

10 participants

participants (10)

Bruce Southey
Charles R Harris
Dag Sverre Seljebotn
David Cournapeau
denis
Francesc Alted
Hoyt Koepke
Jasper van de Gronde
josef.pktd＠gmail.com
Pauli Virtanen

Slicing slower than matrix multiplication?

Jasper van de Gronde

Dag Sverre Seljebotn

Jasper van de Gronde

Dag Sverre Seljebotn

Francesc Alted

Bruce Southey

Francesc Alted

David Cournapeau

Hoyt Koepke

Jasper van de Gronde

Bruce Southey

Jasper van de Gronde

Bruce Southey

Jasper van de Gronde

Charles R Harris

Jasper van de Gronde

Francesc Alted

Francesc Alted

Jasper van de Gronde

Francesc Alted

josef.pktd＠gmail.com

Pauli Virtanen

denis

tags

participants (10)