Numpy x Matlab: some synthetic benchmarks
Hello, Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500: Operation x'*y x*y' A*x A*B A'*x Half 2in2 Dimension 5 Array 0.94 0.7 0.22 0.28 1.12 0.98 1.1 Matrix 7.06 1.57 0.66 0.79 1.6 3.11 4.56 Matlab 1.88 0.44 0.41 0.35 0.37 1.2 0.98 Dimension 50 Array 9.74 3.09 0.56 18.12 13.93 4.2 4.33 Matrix 81.99 3.81 1.04 19.13 14.58 6.3 7.88 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77 Dimension 500 Array 1.2 8.97 2.03 166.59 20.34 3.99 4.31 Matrix 17.95 9.09 2.07 166.62 20.67 4.11 4.45 Matlab 2.09 6.07 2.17 169.45 2.1 2.56 3.06 Obs: The operation Half is actually A*x using only the lower half of the matrix and vector. The operation 2in2 is A*x using only the even indexes. Of course there are many repetitions of the same operation: 100000 for dim 5 and 50 and 1000 for dim 500. The inner product is number of repetitions is multiplied by dimension (it is very fast). The software is numpy svn version 1926 Matlab 6.5.0.180913a Release 13 (Jun 18 2002) Both softwares are using the *same* BLAS and LAPACK (ATLAS for sse). As you can see, numpy array looks very competitive. The matrix class in numpy has too much overhead for small dimension though. This overhead is very small for medium size arrays. Looking at the results above (specially the small dimensions ones, for higher dimensions the main computations are being performed by the same BLAS) I believe we can say: 1) Numpy array is faster on usual operations but outerproduct (I believe the reason is that the dot function uses the regular matrix multiplication to compute outerproducts, instead of using a special function. This can "easily" changes). In particular numpy was faster in matrix times vector operations, which is the most usual in numerical linear algebra. 2) Any operation that involves transpose suffers a very big penalty in numpy. Compare A'*x and A*x, it is 10 times slower. In contrast Matlab deals with transpose quite well. Travis is already aware of this and it can be probably solved. 3) When using subarrays, numpy is a slower. The difference seems acceptable. Travis, can this be improved? Best, Paulo Obs: Latter on (in a couple of days) I may present less synthetic benchmarks (a QR factorization and a Modified Cholesky).
Hi, Thanks for doing this as it helps determine which approach to take when coding problems. Could you add the Numeric and numarray to these benchmarks? If for no other reason to show the advantage of the new numpy. I am curious in your code because you get very different results for matrix class depending on whether x or y is transposed. Do you first transpose the x or y first before the multiplication or is the multiplication done in place by just switching the indices? Also, for x'*y, is the results for Dimension 50 and Dimension 500 switched? Thanks Bruce On 1/18/06, Paulo Jose da Silva e Silva <pjssilva@ime.usp.br> wrote:
Hello,
Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500:
Operation x'*y x*y' A*x A*B A'*x Half 2in2
Dimension 5 Array 0.94 0.7 0.22 0.28 1.12 0.98 1.1 Matrix 7.06 1.57 0.66 0.79 1.6 3.11 4.56 Matlab 1.88 0.44 0.41 0.35 0.37 1.2 0.98
Dimension 50 Array 9.74 3.09 0.56 18.12 13.93 4.2 4.33 Matrix 81.99 3.81 1.04 19.13 14.58 6.3 7.88 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77
Dimension 500 Array 1.2 8.97 2.03 166.59 20.34 3.99 4.31 Matrix 17.95 9.09 2.07 166.62 20.67 4.11 4.45 Matlab 2.09 6.07 2.17 169.45 2.1 2.56 3.06
Obs: The operation Half is actually A*x using only the lower half of the matrix and vector. The operation 2in2 is A*x using only the even indexes.
Of course there are many repetitions of the same operation: 100000 for dim 5 and 50 and 1000 for dim 500. The inner product is number of repetitions is multiplied by dimension (it is very fast).
The software is
numpy svn version 1926 Matlab 6.5.0.180913a Release 13 (Jun 18 2002)
Both softwares are using the *same* BLAS and LAPACK (ATLAS for sse).
As you can see, numpy array looks very competitive. The matrix class in numpy has too much overhead for small dimension though. This overhead is very small for medium size arrays. Looking at the results above (specially the small dimensions ones, for higher dimensions the main computations are being performed by the same BLAS) I believe we can say:
1) Numpy array is faster on usual operations but outerproduct (I believe the reason is that the dot function uses the regular matrix multiplication to compute outerproducts, instead of using a special function. This can "easily" changes). In particular numpy was faster in matrix times vector operations, which is the most usual in numerical linear algebra.
2) Any operation that involves transpose suffers a very big penalty in numpy. Compare A'*x and A*x, it is 10 times slower. In contrast Matlab deals with transpose quite well. Travis is already aware of this and it can be probably solved.
3) When using subarrays, numpy is a slower. The difference seems acceptable. Travis, can this be improved?
Best,
Paulo
Obs: Latter on (in a couple of days) I may present less synthetic benchmarks (a QR factorization and a Modified Cholesky).
 This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.asus.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Em Qua, 20060118 às 07:56 0600, Bruce Southey escreveu:
Hi, Thanks for doing this as it helps determine which approach to take when coding problems. Could you add the Numeric and numarray to these benchmarks? If for no other reason to show the advantage of the new numpy.
I add a new table with the requested benchmarks below. As you can see, numpy is always faster then numarray but slower than Numeric in indexing. Numeric also seems to suffer less from the transpose phenomenon.
I am curious in your code because you get very different results for matrix class depending on whether x or y is transposed. Do you first transpose the x or y first before the multiplication or is the multiplication done in place by just switching the indices?
I think the transpose problem is that the code makes a copy of the array before calling blas (to make it fit the blas call). Actually if you change the code from dot(transpose(A), b) to M = transpose(A).copy() dot(M, b) the time spend in the operation doesn't change. I am also sending my Python benchmark code attached to this message. Anyone can used as you want (it requires numpy, Numeric and numarray installed).
Also, for x'*y, is the results for Dimension 50 and Dimension 500 switched?
No the inner product results have different number of calls for all dimensions. Hence you can't compare time between dimensions. Paulo  New table  Obs: I have fixed a error in my old code that made the outer product for numpy look bad. The way it was coded it was forcing an extra array copy before the BLAS call. Tests x.T*y x*y.T A*x A*B A.T*x half 2in2 Dimension: 5 Array 0.94 0.25 0.22 0.27 1.08 0.93 1.10 Matrix 6.93 1.56 0.64 0.75 1.64 3.08 4.48 NumArr 2.87 0.63 0.62 0.68 8.24 7.22 1 1.37 Numeri 1.15 0.33 0.29 0.36 0.68 0.61 0.72 Matlab 1.88 0.44 0.41 0.35 0.37 1.20 0.98 Dimension: 50 Array 9.64 2.03 0.57 18.74 13.75 4.09 4.29 Matrix 82.98 3.70 1.04 19.87 14.58 6.35 7.91 NumArr 29.72 2.58 0.95 18.40 12.88 8.50 12.90 Numeri 11.97 2.21 0.61 17.71 9.98 1.04 3.22 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77 Dimension: 500 Array 1.22 8.97 2.02 165.03 19.99 3.96 4.25 Matrix 18.13 9.20 2.01 164.90 20.29 4.06 4.35 NumArr 3.16 9.02 2.09 164.83 21.59 4.29 5.72 Numeri 1.46 8.99 2.03 165.01 19.22 3.24 4.50 Matlab 2.09 6.07 2.17 169.45 2.10 2.56 3.06
Ops... Forgot to send the benchmark code... (I know it is ugly and very synthetic, I'll improve it with time).
Hi,
Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500:
This is really excellent, thanks. Is there any chance we can make these and other benchmarks part of the prerelease testing? Apart from testing for bottlenecks, if we could show that we were in the ballpark of matlab for speed for each release, this would be very helpful for those us trying to persuade our matlab colleagues to switch. Best, Matthew
Matthew Brett wrote:
Hi,
Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500:
This is really excellent, thanks. Is there any chance we can make these and other benchmarks part of the prerelease testing? Apart from testing for bottlenecks, if we could show that we were in the ballpark of matlab for speed for each release, this would be very helpful for those us trying to persuade our matlab colleagues to switch.
+1 It might not be part of test(1), but at test(10) these tests could be automatically run, activating each line as each package is found (or not) in the system (Numeric, numarray, matlab). This way, people who have matlab on their box can even get a realtime check of how their freshoffsvn numpy fares against matlab that day. cheers, f
In scipy, we talked about having a benchmark_xyz methods that could be added to the test classes. These weren't run during unit tests (scipy.test()) but would could be run using scipy.benchmark() or something like that. I can't remember if Pearu got the machinery in place, but it seems to me it wouldn't be so hard. You would have to add guards around benchmarks that compare to 3rd party tools, obviously, so that people without them could still run the benchmark suite. Adding a regression process that checks against results from previous builds to flag potential problems when a slow down is noted would be good  that is more work. Anyway, something flagging these "tests" as benchmarks instead of standard correctness tests seems like a good idea. eric Fernando Perez wrote:
Matthew Brett wrote:
Hi,
Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500:
This is really excellent, thanks. Is there any chance we can make these and other benchmarks part of the prerelease testing? Apart from testing for bottlenecks, if we could show that we were in the ballpark of matlab for speed for each release, this would be very helpful for those us trying to persuade our matlab colleagues to switch.
+1
It might not be part of test(1), but at test(10) these tests could be automatically run, activating each line as each package is found (or not) in the system (Numeric, numarray, matlab). This way, people who have matlab on their box can even get a realtime check of how their freshoffsvn numpy fares against matlab that day.
cheers,
f
 This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.asus.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Em Qua, 20060118 às 15:38 +0000, Matthew Brett escreveu:
This is really excellent, thanks. Is there any chance we can make these and other benchmarks part of the prerelease testing? Apart from testing for bottlenecks, if we could show that we were in the ballpark of matlab for speed for each release, this would be very helpful for those us trying to persuade our matlab colleagues to switch.
Sure. As soon as I add some less synthetic benchmark it can be even more interesting. I'll try to make the code easily callable from anyone that has both numpy and matlab on his own machine. As soon as it is polished enough I'll make it available to this comunity. It may take some weeks though. Best, Paulo
Paulo Jose da Silva e Silva wrote:
Hello,
Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500:
Hi Paulo, Will you run these again with the latest SVN version of numpy. I couldn't figure out why a copy was being made on transpose (because it shouldn't have been). Then, I dug deep into the PyArray_FromAny code and found bad logic in when a copy was needed that was causing an inappropriate copy. I fixed that and now wonder how things will change. Because presumably, the dotblas function should handle the situation now... Travis
Em Qua, 20060118 às 11:15 0700, Travis Oliphant escreveu:
Will you run these again with the latest SVN version of numpy. I couldn't figure out why a copy was being made on transpose (because it shouldn't have been). Then, I dug deep into the PyArray_FromAny code and found bad logic in when a copy was needed that was causing an inappropriate copy.
I fixed that and now wonder how things will change. Because presumably, the dotblas function should handle the situation now...
Good work Travis :) Tests x.T*y x*y.T A*x A*B A.T*x half 2in2 Dimension: 5 Array 0.9000 0.2400 0.2000 0.2600 0.7100 0.9400 1.1600 Matrix 4.7800 1.5700 0.6200 0.7600 1.0600 3.0400 4.6500 NumArr 3.2900 0.7400 0.6800 0.7800 8.4800 7.4200 11.6600 Numeri 1.3300 0.3900 0.3100 0.4200 0.7900 0.6800 0.7600 Matlab 1.88 0.44 0.41 0.35 0.37 1.20 0.98 Dimension: 50 Array 9.0000 2.1400 0.5500 18.9500 1.4100 4.2700 4.4500 Matrix 48.7400 3.9200 1.0100 20.2000 1.8000 6.5000 8.1900 NumArr 32.3900 2.6800 1.0000 18.9700 13.0300 8.6300 13.0700 Numeri 13.1000 2.2600 0.6500 18.2700 10.1500 1.0400 3.2600 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77 Dimension: 500 Array 1.1400 9.2300 2.0100 168.2700 2.1800 4.0200 4.2900 Matrix 5.0300 9.3500 2.1500 167.5300 2.1700 4.1100 4.4200 NumArr 3.4400 9.1000 2.1000 168.7100 21.8400 4.3900 5.8900 Numeri 1.5800 9.2700 2.0700 167.5600 20.0500 3.4000 4.6800 Matlab 2.09 6.07 2.17 169.45 2.10 2.56 3.06 Note the 10fold speedup for higher dimensions :) It looks like that now that numpy only looses to matlab in small dimensions. Probably, the problem is the creation of the object to represent the transposed object. Probably Matlab creation of objects is very lightweight (they only have matrices objects to deal with). Probably this phenomenon explains the behavior for the indexing operations too. Paulo
Paulo J. S. Silva wrote:
Em Qua, 20060118 às 11:15 0700, Travis Oliphant escreveu:
Will you run these again with the latest SVN version of numpy. I couldn't figure out why a copy was being made on transpose (because it shouldn't have been). Then, I dug deep into the PyArray_FromAny code and found bad logic in when a copy was needed that was causing an inappropriate copy.
I fixed that and now wonder how things will change. Because presumably, the dotblas function should handle the situation now...
Good work Travis :)
Tests x.T*y x*y.T A*x A*B A.T*x half 2in2
Dimension: 5 Array 0.9000 0.2400 0.2000 0.2600 0.7100 0.9400 1.1600 Matrix 4.7800 1.5700 0.6200 0.7600 1.0600 3.0400 4.6500 NumArr 3.2900 0.7400 0.6800 0.7800 8.4800 7.4200 11.6600 Numeri 1.3300 0.3900 0.3100 0.4200 0.7900 0.6800 0.7600 Matlab 1.88 0.44 0.41 0.35 0.37 1.20 0.98
Dimension: 50 Array 9.0000 2.1400 0.5500 18.9500 1.4100 4.2700 4.4500 Matrix 48.7400 3.9200 1.0100 20.2000 1.8000 6.5000 8.1900 NumArr 32.3900 2.6800 1.0000 18.9700 13.0300 8.6300 13.0700 Numeri 13.1000 2.2600 0.6500 18.2700 10.1500 1.0400 3.2600 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77
Dimension: 500 Array 1.1400 9.2300 2.0100 168.2700 2.1800 4.0200 4.2900 Matrix 5.0300 9.3500 2.1500 167.5300 2.1700 4.1100 4.4200 NumArr 3.4400 9.1000 2.1000 168.7100 21.8400 4.3900 5.8900 Numeri 1.5800 9.2700 2.0700 167.5600 20.0500 3.4000 4.6800 Matlab 2.09 6.07 2.17 169.45 2.10 2.56 3.06
Note the 10fold speedup for higher dimensions :)
It looks like that now that numpy only looses to matlab in small dimensions. Probably, the problem is the creation of the object to represent the transposed object. Probably Matlab creation of objects is very lightweight (they only have matrices objects to deal with). Probably this phenomenon explains the behavior for the indexing operations too.
Paulo
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices. I'm not familiar enough with what the normal constructor does to know if we could implement something, (in C, perhaps) that would do nothing but create a simple, contiguous array significantly faster than what is currently done. Or does the current constructor create a new instance about as fast as possible? I know Travis has optimized it, but it's a general purpose constructor, and I'm thinking these extra features may take some extra CPU cycles. Cheers! Andrew
Andrew Straw wrote:
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices.
The general purpose constructor is PyArray_NewFromDescr(...) I suspect this could be special cased for certain circumstances and the specialcase called occasionally. Their are checks on the dimensions that could be avoided in certain circumstances (like when we are getting the dimensions from another arrayobject already...) We could also inline the __array_from_strides code... Besides that, I'm not sure what else to optimize... Travis
Travis Oliphant wrote:
Andrew Straw wrote:
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices.
The general purpose constructor is
PyArray_NewFromDescr(...)
I suspect this could be special cased for certain circumstances and the specialcase called occasionally. Their are checks on the dimensions that could be avoided in certain circumstances (like when we are getting the dimensions from another arrayobject already...)
We could also inline the __array_from_strides code...
Besides that, I'm not sure what else to optimize...
Just to give some context: this came to my mind inspired by Blitz++'s TinyVector and TinyMatrix objects. In Blitz, arrays have compiletime rank, but runtime size in all dimensions. Since this introduces some overhead, Blitz offers also the Tiny* classes, which are compiletime fixed _both_ in rank and in size. This allows a number of optimizations to be made on them, at the cost of some flexibility lost. Some info on these guys: http://www.oonumerics.org/blitz/manual/blitz07.html What Andrew and I discussed was the idea of writing some object which would only support the most basic operations: elementwise arithmetic, slicing, linear algebra calls on them (matrixmatrix, matrixvector and vector operations), and little else. I'd be OK losing fancy indexing, byteswapping, memorymapping, reshaping, and anything else which costs either: 1. initializationtime CPU cycles 2. memory footprint 3. runtime element access and arithmetic. Such objects could be very useful in many contexts. I'd even like an immutable version, so they could be used as dictionary keys without having to make a tuple out of them. This would allow algorithms which use small arrays as multidimensional indices in sparse tree structures to be used without the hoops one must jump through today, and with higher performance. I wonder if giving up reshaping would allow the indexing code to be faster, as specialized versions could be hardcoded for each rank, with only say ranks 14 offered for this kind of object (I know we recently had a discussion about large ranks, but this object would be geared towards pure performance, and certainly working in 32 dimensions is a 'flexibilitydriven' case, where the generic objects are called for). Note that I had never mentioned this in public, because I think it may be a slight specialization that isn't needed early on, and currently the library's priority was to get off the ground. But having such objects could be very handy, and now that the C API is starting to stabilize, maybe someone can play with this as a side project. Once they prove their worth, these beasts could be folded as part of the official distribution. I am not really qualified to judge whether there are enough areas for optimization where the sacrifices indicated could really pay off, both in terms of memory and performance. Cheers, f
It's not a new idea. I raised it some time ago and I don't think it was new then either. I have to believe that if you allowed only Float64 (and perhaps a complex variant) and used other restrictions then it would be much faster for small arrays. One would think it would be much easier to implement than Numeric/numarray/numpy... I've always thought that those looking for really fast small array performance would be better served by something like this. But you'd really have to fight off feature creep. ("This almost meets my needs. If it could only do xxx") Perry On Jan 18, 2006, at 5:00 PM, Fernando Perez wrote:
Travis Oliphant wrote:
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices. The general purpose constructor is PyArray_NewFromDescr(...) I suspect this could be special cased for certain circumstances and
Andrew Straw wrote: the specialcase called occasionally. Their are checks on the dimensions that could be avoided in certain circumstances (like when we are getting the dimensions from another arrayobject already...) We could also inline the __array_from_strides code... Besides that, I'm not sure what else to optimize...
Just to give some context: this came to my mind inspired by Blitz++'s TinyVector and TinyMatrix objects. In Blitz, arrays have compiletime rank, but runtime size in all dimensions. Since this introduces some overhead, Blitz offers also the Tiny* classes, which are compiletime fixed _both_ in rank and in size. This allows a number of optimizations to be made on them, at the cost of some flexibility lost. Some info on these guys:
http://www.oonumerics.org/blitz/manual/blitz07.html
What Andrew and I discussed was the idea of writing some object which would only support the most basic operations: elementwise arithmetic, slicing, linear algebra calls on them (matrixmatrix, matrixvector and vector operations), and little else. I'd be OK losing fancy indexing, byteswapping, memorymapping, reshaping, and anything else which costs either:
1. initializationtime CPU cycles 2. memory footprint 3. runtime element access and arithmetic.
Such objects could be very useful in many contexts. I'd even like an immutable version, so they could be used as dictionary keys without having to make a tuple out of them. This would allow algorithms which use small arrays as multidimensional indices in sparse tree structures to be used without the hoops one must jump through today, and with higher performance.
I wonder if giving up reshaping would allow the indexing code to be faster, as specialized versions could be hardcoded for each rank, with only say ranks 14 offered for this kind of object (I know we recently had a discussion about large ranks, but this object would be geared towards pure performance, and certainly working in 32 dimensions is a 'flexibilitydriven' case, where the generic objects are called for).
Note that I had never mentioned this in public, because I think it may be a slight specialization that isn't needed early on, and currently the library's priority was to get off the ground. But having such objects could be very handy, and now that the C API is starting to stabilize, maybe someone can play with this as a side project. Once they prove their worth, these beasts could be folded as part of the official distribution.
I am not really qualified to judge whether there are enough areas for optimization where the sacrifices indicated could really pay off, both in terms of memory and performance.
Cheers,
f
 This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.asus.falkag.net/sel? cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Perry Greenfield wrote:
It's not a new idea. I raised it some time ago and I don't think it was
I wasn't claiming authorship, sorry if it sounded like that :) In fact, I remember specifically talking with you about this at scipy'03, in the context of small array performance issues for the atthetimenascent numarray, and I'm sure similar things have been done many times before. I've had it floating in my head since I first saw blitz, back in 2001, and blitz probably got it from... There's nothing really new under the sun ;)
new then either. I have to believe that if you allowed only Float64 (and perhaps a complex variant) and used other restrictions then it would be much faster for small arrays. One would think it would be much easier to implement than Numeric/numarray/numpy... I've always thought that those looking for really fast small array performance would be better served by something like this. But you'd really have to fight off feature creep. ("This almost meets my needs. If it could only do xxx")
Couldn't that last issue be well dealt with by the fact that today's numpy is fairly subclassingfriendly? (which, if I remember correctly, wasn't quite the case with at least old Numeric). Cheers, f
On Jan 18, 2006, at 6:21 PM, Fernando Perez wrote:
Perry Greenfield wrote:
It's not a new idea. I raised it some time ago and I don't think it was
I wasn't claiming authorship, sorry if it sounded like that :) In fact, I remember specifically talking with you about this at scipy'03, in the context of small array performance issues for the atthetimenascent numarray, and I'm sure similar things have been done many times before. I've had it floating in my head since I first saw blitz, back in 2001, and blitz probably got it from... There's nothing really new under the sun ;)
Really :). I remember that conversation and wondered if it had something to do with that. (And I remember Paul Dubois talking to me about similar ideas). I think it is worth trying (and has been I see, though I would have expected perhaps even a greater speed improvement; somehow I think it should not take a lot of time if you don't need all the type, shape and striding flexibility). It just needs someone to do it.
new then either. I have to believe that if you allowed only Float64 (and perhaps a complex variant) and used other restrictions then it would be much faster for small arrays. One would think it would be much easier to implement than Numeric/numarray/numpy... I've always thought that those looking for really fast small array performance would be better served by something like this. But you'd really have to fight off feature creep. ("This almost meets my needs. If it could only do xxx")
Couldn't that last issue be well dealt with by the fact that today's numpy is fairly subclassingfriendly? (which, if I remember correctly, wasn't quite the case with at least old Numeric).
Does that help? You aren't talking about the fast array subclassing numpy are you? I'm not sure what you mean here. Perry
Perry Greenfield wrote:
On Jan 18, 2006, at 6:21 PM, Fernando Perez wrote:
Really :). I remember that conversation and wondered if it had something to do with that. (And I remember Paul Dubois talking to me about similar ideas). I think it is worth trying (and has been I see, though I would have expected perhaps even a greater speed improvement; somehow I think it should not take a lot of time if you don't need all the type, shape and striding flexibility). It just needs someone to do it.
Maybe putting David's code into the sandbox would be a good starting point.
new then either. I have to believe that if you allowed only Float64 (and perhaps a complex variant) and used other restrictions then it would be much faster for small arrays. One would think it would be much easier to implement than Numeric/numarray/numpy... I've always thought that those looking for really fast small array performance would be better served by something like this. But you'd really have to fight off feature creep. ("This almost meets my needs. If it could only do xxx")
Couldn't that last issue be well dealt with by the fact that today's numpy is fairly subclassingfriendly? (which, if I remember correctly, wasn't quite the case with at least old Numeric).
Does that help? You aren't talking about the fast array subclassing numpy are you? I'm not sure what you mean here.
What I meant was that by having good subclassing functionality, it's easier to ward off requests for every feature under the sun. It's much easier to say: 'this basic object provides a very small, core set of array features where the focus is on raw speed rather than fancy features; if you need extra features, subclass it and add them yourself' when the subclassing is actually reasonably easy. Note that I haven't actually used array subclassing myself (haven't needed it), so I may be mistaken in my comments here, it's just an intuition. Cheers, f
On Jan 18, 2006, at 8:08 PM, Fernando Perez wrote:
Couldn't that last issue be well dealt with by the fact that today's numpy is fairly subclassingfriendly? (which, if I remember correctly, wasn't quite the case with at least old Numeric). Does that help? You aren't talking about the fast array subclassing numpy are you? I'm not sure what you mean here.
What I meant was that by having good subclassing functionality, it's easier to ward off requests for every feature under the sun. It's much easier to say:
'this basic object provides a very small, core set of array features where the focus is on raw speed rather than fancy features; if you need extra features, subclass it and add them yourself'
when the subclassing is actually reasonably easy. Note that I haven't actually used array subclassing myself (haven't needed it), so I may be mistaken in my comments here, it's just an intuition.
Yeah, that makes sense. I was confused by the reference to numpy. Perry
Fernando Perez <Fernando.Perez@colorado.edu> writes:
Travis Oliphant wrote:
Andrew Straw wrote:
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices. The general purpose constructor is PyArray_NewFromDescr(...) I suspect this could be special cased for certain circumstances and the specialcase called occasionally. Their are checks on the dimensions that could be avoided in certain circumstances (like when we are getting the dimensions from another arrayobject already...) We could also inline the __array_from_strides code... Besides that, I'm not sure what else to optimize...
Just to give some context: this came to my mind inspired by Blitz++'s TinyVector and TinyMatrix objects. In Blitz, arrays have compiletime rank, but runtime size in all dimensions. Since this introduces some overhead, Blitz offers also the Tiny* classes, which are compiletime fixed _both_ in rank and in size. This allows a number of optimizations to be made on them, at the cost of some flexibility lost. Some info on these guys:
http://www.oonumerics.org/blitz/manual/blitz07.html
What Andrew and I discussed was the idea of writing some object which would only support the most basic operations: elementwise arithmetic, slicing, linear algebra calls on them (matrixmatrix, matrixvector and vector operations), and little else. I'd be OK losing fancy indexing, byteswapping, memorymapping, reshaping, and anything else which costs either:
1. initializationtime CPU cycles 2. memory footprint 3. runtime element access and arithmetic.
Such objects could be very useful in many contexts. I'd even like an immutable version, so they could be used as dictionary keys without having to make a tuple out of them. This would allow algorithms which use small arrays as multidimensional indices in sparse tree structures to be used without the hoops one must jump through today, and with higher performance.
I've done a little bit of work along these lines. I have a module I call vector3 [*] which has 2 and 3dimensional immutable vectors, using either ints or doubles. It's as fast as I could make it, while keeping it all written in Pyrex. I find it very convenient for anything vectorrelated. Konrad Hinsen has something similiar in the development version of his ScientificPython package. [*] http://arbutus.mcmaster.ca/dmc/software/vector3.html Also, I've also done some playing around with a ndimensional vector type (restricted to doubles). My best attempts make it ~45x faster than numpy (and 2x faster than Numeric) for vectors of dimension 10 on simple ops like + and *, 2x faster than numpy for dimension 1000, and approaching 1x as you make the vectors larger. Indexing is about 3x faster than numpy, and 1.4x faster than Numeric. So that gives I think some idea of the maximum speedup possible. I think the speedups mostly come from the utter lack of any polymorphism: it handles vectors of doubles only, and only as contiguous vectors (no strides).  >\/< /\ David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ cookedm@physics.mcmaster.ca
David M. Cooke wrote:
I've done a little bit of work along these lines. I have a module I call vector3 [*] which has 2 and 3dimensional immutable vectors, using either ints or doubles. It's as fast as I could make it, while keeping it all written in Pyrex. I find it very convenient for anything vectorrelated. Konrad Hinsen has something similiar in the development version of his ScientificPython package.
[*] http://arbutus.mcmaster.ca/dmc/software/vector3.html
Also, I've also done some playing around with a ndimensional vector type (restricted to doubles). My best attempts make it ~45x faster than numpy (and 2x faster than Numeric) for vectors of dimension 10 on simple ops like + and *, 2x faster than numpy for dimension 1000, and approaching 1x as you make the vectors larger. Indexing is about 3x faster than numpy, and 1.4x faster than Numeric. So that gives I think some idea of the maximum speedup possible.
I think the speedups mostly come from the utter lack of any polymorphism: it handles vectors of doubles only, and only as contiguous vectors (no strides).
This is excellent, thanks for the pointer. I can see uses for vectors (still 1d, no strides, etc) with more than 3 elements, and perhaps fixedsize (no reshaping, no striding) 2d arrays (matrices), but this looks like a good starting point. Sandbox material? Cheers, f
Fernando Perez wrote:
David M. Cooke wrote:
I've done a little bit of work along these lines. I have a module I call vector3 [*] which has 2 and 3dimensional immutable vectors, using either ints or doubles. It's as fast as I could make it, while keeping it all written in Pyrex. I find it very convenient for anything vectorrelated. Konrad Hinsen has something similiar in the development version of his ScientificPython package.
[*] http://arbutus.mcmaster.ca/dmc/software/vector3.html
Also, I've also done some playing around with a ndimensional vector type (restricted to doubles). My best attempts make it ~45x faster than numpy (and 2x faster than Numeric) for vectors of dimension 10 on simple ops like + and *, 2x faster than numpy for dimension 1000, and approaching 1x as you make the vectors larger. Indexing is about 3x faster than numpy, and 1.4x faster than Numeric. So that gives I think some idea of the maximum speedup possible.
I think the speedups mostly come from the utter lack of any polymorphism: it handles vectors of doubles only, and only as contiguous vectors (no strides).
This is excellent, thanks for the pointer. I can see uses for vectors (still 1d, no strides, etc) with more than 3 elements, and perhaps fixedsize (no reshaping, no striding) 2d arrays (matrices), but this looks like a good starting point. Sandbox material?
With the array interface, these kinds of objects can play very nicely with full ndarray's as well... Travis
Fernando Perez <Fernando.Perez@colorado.edu> writes:
David M. Cooke wrote:
I've done a little bit of work along these lines. I have a module I call vector3 [*] which has 2 and 3dimensional immutable vectors, using either ints or doubles. It's as fast as I could make it, while keeping it all written in Pyrex. I find it very convenient for anything vectorrelated. Konrad Hinsen has something similiar in the development version of his ScientificPython package. [*] http://arbutus.mcmaster.ca/dmc/software/vector3.html Also, I've also done some playing around with a ndimensional vector type (restricted to doubles). My best attempts make it ~45x faster than numpy (and 2x faster than Numeric) for vectors of dimension 10 on simple ops like + and *, 2x faster than numpy for dimension 1000, and approaching 1x as you make the vectors larger. Indexing is about 3x faster than numpy, and 1.4x faster than Numeric. So that gives I think some idea of the maximum speedup possible. I think the speedups mostly come from the utter lack of any polymorphism: it handles vectors of doubles only, and only as contiguous vectors (no strides).
This is excellent, thanks for the pointer. I can see uses for vectors (still 1d, no strides, etc) with more than 3 elements, and perhaps fixedsize (no reshaping, no striding) 2d arrays (matrices), but this looks like a good starting point. Sandbox material?
Well, I'd be pleased to donate vector3 to scipy as a sandbox (although I think it's very useful as a standalone package for others with no need of full scipy. The general vector stuff is still quite rough, and at this point is really only meant as a study on how fast I can make something like that run :)  >\/< /\ David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ cookedm@physics.mcmaster.ca
Here's an idea Fernando and I have briefly talked about offlist, but which perhaps bears talking about here: Is there speed to be gained by an alternative, very simple, very optimized ndarray constructor? The idea would be a specialcase constructor with very limited functionality designed purely for speed. It wouldn't support (m)any of the fantastic things Travis has done, but would be useful only in specialized use cases, such as creating indices.
I'm not familiar enough with what the normal constructor does to know if we could implement something, (in C, perhaps) that would do nothing but create a simple, contiguous array significantly faster than what is currently done. Or does the current constructor create a new instance about as fast as possible? I know Travis has optimized it, but it's a general purpose constructor, and I'm thinking these extra features may take some extra CPU cycles.
I think the indexing code will be slower because it is more sophisticated than Numeric's. Basically, it has to check for fancy indexing before defaulting to the old way. I see this as more of a slowdown than array creation. It might be possible to improve it  more eyeballs are always helpful. But, I'm not sure how at this point. Travis
A Dimecres 18 Gener 2006 20:55, Travis Oliphant va escriure:
I think the indexing code will be slower because it is more sophisticated than Numeric's. Basically, it has to check for fancy indexing before defaulting to the old way. I see this as more of a slowdown than array creation. It might be possible to improve it  more eyeballs are always helpful. But, I'm not sure how at this point.
I'm sure you already know this: http://oprofile.sourceforge.net It is a nice way to do profiling at C level on Linux machines. Running the Paulo benchmarks through oprofile can surely bring some light. Cheers, 
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
On Thu, 19 Jan 2006, Francesc Altet wrote:
A Dimecres 18 Gener 2006 20:55, Travis Oliphant va escriure:
I think the indexing code will be slower because it is more sophisticated than Numeric's. Basically, it has to check for fancy indexing before defaulting to the old way. I see this as more of a slowdown than array creation. It might be possible to improve it  more eyeballs are always helpful. But, I'm not sure how at this point.
I'm sure you already know this:
http://oprofile.sourceforge.net
It is a nice way to do profiling at C level on Linux machines. Running the Paulo benchmarks through oprofile can surely bring some light.
What might be nice is to use the output of oprofile and stick it into kcachegrind, http://kcachegrind.sourceforge.net/ to get a GUI to the results. I haven't done this, but according to http://linux.com.hk/penguin/man/1/op2calltree.html the tool is `op2calltree`  convert OProfile profiling data to KCachegrind calltree format and further "This utility is part of the KDE Software Development Kit." Best, Arnd
Francesc Altet wrote:
http://oprofile.sourceforge.net
It is a nice way to do profiling at C level on Linux machines. Running the Paulo benchmarks through oprofile can surely bring some light.
I ran the following code snippet (timed under a Timeit instance) through the oprofile profiler for both NumPy and Numeric, to look at indexing speeds. op = "b = A[::2,::2]; d = A[1:80,:]" This is what I found: overall 61242 53.9606 /usr/bin/python 17647 15.5488 /usr/lib/python2.4/sitepackages/numpy/core/multiarray.so 15942 14.0466 /lib/tls/libc2.3.3.so 7158 6.3069 /novmlinux 6995 6.1633 /usr/lib/python2.4/sitepackages/Numeric/_numpy.so Showing that more time is spent in NumPy than in Numeric doing indexing... Here's the breakdown for NumPy samples % symbol name 2353 13.3337 PyArray_PyIntAsIntp # This is also slower  called more often? 2060 11.6734 PyArray_MapIterNew # This calls fancy_indexing_check. 1980 11.2200 slice_GetIndices 1631 9.2424 parse_index 1149 6.5110 arraymapiter_dealloc # Interesting this is taking so long? 1142 6.4714 array_subscript 1121 6.3524 _IsAligned 1069 6.0577 array_dealloc 780 4.4200 fancy_indexing_check 684 3.8760 PyArray_NewFromDescr 627 3.5530 parse_subindex 538 3.0487 PyArray_DescrFromType 534 3.0260 array_subscript_nice 455 2.5783 _IsContiguous 370 2.0967 _IsFortranContiguous 334 1.8927 slice_coerce_index 294 1.6660 PyArray_UpdateFlags 234 1.3260 anonymous symbol from section .plt 161 0.9123 PyArray_Return 120 0.6800 array_alloc 2 0.0113 PyArray_Broadcast 2 0.0113 PyArray_IterNew 1 0.0057 LONG_setitem 1 0.0057 PyArray_EquivTypes 1 0.0057 PyArray_FromAny 1 0.0057 PyArray_FromStructInterface 1 0.0057 PyArray_IntpConverter 1 0.0057 PyArray_SetNumericOps 1 0.0057 initialize_numeric_types Here's the breakdown for Numeric: 1577 22.5447 slice_GetIndices 1155 16.5118 parse_index 912 13.0379 PyArray_FromDimsAndDataAndDescr 792 11.3224 array_subscript 675 9.6497 PyArray_IntegerAsInt 517 7.3910 parse_subindex 401 5.7327 array_dealloc 379 5.4182 slice_coerce_index 339 4.8463 array_subscript_nice 161 2.3016 anonymous symbol from section .plt 82 1.1723 PyArray_Return 5 0.0715 do_sliced_copy Anybody interested in optimization? Travis
Travis Oliphant wrote:
This is what I found: overall
61242 53.9606 /usr/bin/python 17647 15.5488 /usr/lib/python2.4/sitepackages/numpy/core/multiarray.so 15942 14.0466 /lib/tls/libc2.3.3.so 7158 6.3069 /novmlinux 6995 6.1633 /usr/lib/python2.4/sitepackages/Numeric/_numpy.so
Showing that more time is spent in NumPy than in Numeric doing indexing...
I optimized PyArray_PyIntAsIntp and changed things so that extra allocations are not done if the indexing is not fancy in the new svn of numpy and improved the performance of indexing a bit... The new numbers... samples % symbol name 551 18.4775 slice_GetIndices 482 16.1636 array_subscript 343 11.5023 parse_index 309 10.3622 PyArray_PyIntAsIntp 166 5.5667 fancy_indexing_check 164 5.4997 _IsAligned 140 4.6948 array_dealloc 134 4.4936 parse_subindex 133 4.4601 anonymous symbol from section .plt 127 4.2589 PyArray_NewFromDescr 121 4.0577 slice_coerce_index 79 2.6492 _IsFortranContiguous 67 2.2468 array_subscript_nice 56 1.8779 _IsContiguous 49 1.6432 array_alloc 31 1.0396 PyArray_UpdateFlags 29 0.9725 PyArray_Return 1 0.0335 array_itemsize_get For reference, the Numeric numbers are: samples % symbol name 375 25.9695 slice_GetIndices 255 17.6593 parse_index 198 13.7119 PyArray_IntegerAsInt 135 9.3490 array_subscript 130 9.0028 PyArray_FromDimsAndDataAndDescr 117 8.1025 parse_subindex 81 5.6094 slice_coerce_index 55 3.8089 array_subscript_nice 44 3.0471 array_dealloc 40 2.7701 anonymous symbol from section .plt 13 0.9003 PyArray_Return 1 0.0693 do_sliced_copy These look a bit better and the results show that simple indexing seems to be only slowed down by the fancy indexing check... Travis
Hi, I don't know if people remember this thread, but my memory was that there might be some interest in including numpy / matlab benchmark code somewhere in the distribution, to check on performance and allow matlab people to do a direct comparison. Is this still of interest? If so, what should be the next step? Thanks a lot, Matthew  Forwarded message  From: Paulo Jose da Silva e Silva <pjssilva@ime.usp.br> Date: Jan 18, 2006 11:44 AM Subject: [Numpydiscussion] Numpy x Matlab: some synthetic benchmarks To: numpydiscussion <numpydiscussion@lists.sourceforge.net> Hello, Travis asked me to benchmark numpy versus matlab in some basic linear algebra operations. Here are the resuts for matrices/vectors of dimensions 5, 50 and 500: Operation x'*y x*y' A*x A*B A'*x Half 2in2 Dimension 5 Array 0.94 0.7 0.22 0.28 1.12 0.98 1.1 Matrix 7.06 1.57 0.66 0.79 1.6 3.11 4.56 Matlab 1.88 0.44 0.41 0.35 0.37 1.2 0.98 Dimension 50 Array 9.74 3.09 0.56 18.12 13.93 4.2 4.33 Matrix 81.99 3.81 1.04 19.13 14.58 6.3 7.88 Matlab 16.98 1.94 1.07 17.86 0.73 1.57 1.77 Dimension 500 Array 1.2 8.97 2.03 166.59 20.34 3.99 4.31 Matrix 17.95 9.09 2.07 166.62 20.67 4.11 4.45 Matlab 2.09 6.07 2.17 169.45 2.1 2.56 3.06 Obs: The operation Half is actually A*x using only the lower half of the matrix and vector. The operation 2in2 is A*x using only the even indexes. Of course there are many repetitions of the same operation: 100000 for dim 5 and 50 and 1000 for dim 500. The inner product is number of repetitions is multiplied by dimension (it is very fast). The software is numpy svn version 1926 Matlab 6.5.0.180913a Release 13 (Jun 18 2002) Both softwares are using the *same* BLAS and LAPACK (ATLAS for sse). As you can see, numpy array looks very competitive. The matrix class in numpy has too much overhead for small dimension though. This overhead is very small for medium size arrays. Looking at the results above (specially the small dimensions ones, for higher dimensions the main computations are being performed by the same BLAS) I believe we can say: 1) Numpy array is faster on usual operations but outerproduct (I believe the reason is that the dot function uses the regular matrix multiplication to compute outerproducts, instead of using a special function. This can "easily" changes). In particular numpy was faster in matrix times vector operations, which is the most usual in numerical linear algebra. 2) Any operation that involves transpose suffers a very big penalty in numpy. Compare A'*x and A*x, it is 10 times slower. In contrast Matlab deals with transpose quite well. Travis is already aware of this and it can be probably solved. 3) When using subarrays, numpy is a slower. The difference seems acceptable. Travis, can this be improved? Best, Paulo Obs: Latter on (in a couple of days) I may present less synthetic benchmarks (a QR factorization and a Modified Cholesky).  This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.asus.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
participants (13)

Andrew Straw

Arnd Baecker

Bruce Southey

cookedm＠physics.mcmaster.ca

eric jones

Fernando Perez

Francesc Altet

Matthew Brett

Paulo J. S. Silva

Paulo Jose da Silva e Silva

Perry Greenfield

Travis Oliphant

Travis Oliphant