numarray interface and performance issues (for dot product and transpose)
![](https://secure.gravatar.com/avatar/eb281ac8437ba6df4ef5f0f9686e7c3e.jpg?s=120&d=mm&r=g)
Hi, Numeric is an impressively powerful and in many respects easy and comfortable to use package (e.g. it's sophisticated slicing operations, not to mention the power and elegance of the underlying python language) and one would hope that it can one day replace Matlab (which is both expensive and a nightmare as a programming language) as a standard platform for numerical calculations. Two important drawbacks of Numeric (python) compared to Matlab currently seem to be the plotting functionality and the availability of specialist libraries (say for pattern recognition problems etc.). However python seems to be quickly catching up in this area (e.g. scipy and other projects are bustling with activity and, as far as I know, there is currently a promising undergraduate project on the way at my university to provide powerful plotting facilities for python). There is however a problem that, for the use to which I want to put Numeric, runs deeper and provides me with quite a headache: Two essential matrix operations (matrix-multiplication and transposition (which is what I am mainly using) are both considerably a) less efficient and b) less notationally elegant under Numeric than under Matlab. Before I expound on that, I'd better mention two things: Firstly, I, have to confess largely ignorant about the gritty details of implementing efficient numerical computation (and indeed, I can't even claim to possess a good grasp of the underlying Linear Algebra). Secondly, and partly as a consequence, I realize that I'm likely to view this from a rather narrow perspective. Nonetheless, I need to solve these problems at least for myself, even if such a solution might be inappropriate for a more general audience, so I'd be very keen to hear what suggestions or experiences from other Numeric users are. I also want to share what my phd supervisor, Richard Everson, and I have come up with so far in terms of a), as I think it might be quite useful for other people, too (see [2], more about this later). Ok, so now let me proceed to state clearly what I mean by a) and b): The following Matlab fragment M * (C' * C) * V' * u currently has to be written like this in python: dot(dot(dot(M, dot(transpose(C), C)), transpose(v)), u) Or, even worse if one doesn't want to pollute the namespace: Numeric.dot(Numeric.dot(Numeric.dot(Numeric.M, Numeric.dot(Numeric.transpose(C), C)), Numeric.transpose(v)), u) Part a) (efficiency) -------------------- Apart from the syntactic inconveniences, the Matlab above will execute __considerably__ faster if C is a rather larger matrix (say 1000x1000). There are AFAIK two chief reasons for this: 1. Matlab uses a highly optimized ATLAS[3] BLAS routine to compute the dot-product, whereas Numeric uses a straight-forward home-brewn solution. 2. Numeric performs unnecessary transpose operations (prior to 20.3, I think, more about this later). The transpose operation is really damaging with big matrices, because it creates a complete copy, rather than trying to do something lazy (if your memory is already almost half filled up with (matrix) C, then creating a (in principle superfluous) transposed copy is not going to do you any good). The above C' * C actually creates, AFAIK, _3_ versions of C, 2 of them transposed (prior to 20.3; dot(a,b) translates into innerproduct(a, swapaxes(b, -1, -2)) In newer versions of Numeric, this is replaced by multiarray.matrixproduct(a, b) which has the considerable advantage that it doesn't create an unnecessary copy and the considerable disadvantage that it seems to be factor 3 or so slower than the old (already not blazingly fast) version for large Matrix x Matrix multiplication, (see timing results [1])). Now fortunately, we made some significant progress on the performance issues. My supervisor valiantly patched Numeric's multiarrayobject.c to use ATLAS for {scalar,vector,matrix} * {scalar,vector,matrix} multiplications. As a consequence the multiplication of pair of 1000x1000 matrices is computed more than _40_ times faster on my athlon machine (see [1], it seems to work fine but both the timing results as well as the code should be treated with a bit of caution). The BLAS-enabled dot takes almost exactly the same time as Matlab doing the same thing. In my case, this makes the difference between waiting several minutes and a day, and between deciding to use python or something faster but more painful. Although to benefit from this patch, one has to install ATLAS this is straight-forward (see [2] and [3] for pointers). In addition it is easy to install Lapack at the same time, as ATLAS provides significantly optimised versions of some of the Lapack routines too. Using Lapack rather than the lapack_lite that comes with Numeric also has the advantage that it is extensively tested and works very reliably (e.g. the lapack_lite routine that is called by Heigenvalues is pretty broken and searching the net suggests that this has been the case for some time now). This still leaves point 2), transposes, as a potential source for improvement. As far as I understand, the Lapack routines operate on flattened arrays, i.e. the only difference on calling some routine on a matrix and its transpose is that one has to pass a different additional parameters in the second case that specifies the structure of the matrix. Therefore if one were to program C' * C in fortran, only C itself needed to exist in memory. Would it therefore be possible that operations like transpose, slicing etc. be carried out lazily (i.e. so that only _modifying_ the result of the transpose or slice or passing it to some function for some reason needs a real copy resulted in an actual copy operation being performed)? As far as I understand the slicing in numarray is no longer simple aliasing (which is a good thing, since it make arrays more compatible with lists), are there any plans for implementing such an on-demand scheme? Part b) (syntax) ________________ As I said, dot(dot(dot(M, dot(transpose(C), C)), transpose(v)), u) is pretty obscure compared to M * (C' * C) * V' * u) Although linear algebra is not the be all and end all of numerical calculations, many numerical algorithms end up involving a good deal of linear algebra. The awkwardness of python/Numeric's notation for common linear algebra operation severely impairs, in my opinion, the utility of writing, using and debugging algorithms in python/Numeric. This is a particular shame given python's generally elegant and expresssive syntax. So far, I've thought of the following possible solutions: 0. Do nothing: Just live with the awkward syntax. 1. Comment: add a comment like the Matlab line above above complex expressions. 2. Wait for fix. Hope that either: 2.1. numarray will overload '*' as matrix multiplication, rather than elementwise multiplication (highly unlikely, I suspect). 2.2. the core python team accepts one of the proposed operator additions (or even, fond phantasy, allows users to add new operators) 3. Wrap: create a DotMatrix class that overloads '*' to be dot and maybe self.t to return the transpose -- this also means that all the numerical libraries I frequently use need to be wrapped. 4. Preprocess: Write some preprocessor that translates e.g. A ~* B in dot(A,B). 5. Hack: create own customized version of array that overloads e.g. '<<' to mean dot-product. The possible downsides: 0. Do nothing: Seems unacceptable (unless as a temporary solution), because the code becomes much less readable (and consequently harder to understand and more error-prone), thus forgoing one of the main reasons to choose python in the first place. 1. Comment: Unsatisfactory, because there is no way to gurantee that comment and code are in synch, which very likely will lead to difficult to find bugs. 2. Wait for a fix: Either would be nice (I could live with having to write multiply(a,b) -- I suppose however some other people can't, because I can't think of another reason why Numeric didn't overload * for matrixproduct in the first place, moreover it would mean a significant interface change (and code breakage), so I guess it is rather unlikely). From what I gather from searching the net and looking at peps there is also not much of a chance to get a new operator anytime soon. 3. Wrap: Promises to be a maintenance nightmare (BTW: will the new array class be subclassable?), but otherwise looks feasible. Has anyone done this? 4. Preprocess: Would have the advantage that I could always get back to "standard" python/Numeric code, but would involve quite a bit of work and might also break tools that parse python. 5. Hack array: Seems evil, but in principle feasible, because AFAIK '<<' isn't used in Numeric itself and hopefully it wouldn't involve too much work. However, given a free choice '<<' is hardly the operator I would choose to represent the dot product. I am not completely sure what the policy and rationale behind the current division of labor between functions, methods and attributes in numarray is but as far as the lack of a missing transposition postfix operator is concerned, one reasonable approach to make the transpose operation more readable (without the need to change anything in python itself) would seem to me to provide arrays with an attribute .t or .T so that: a.t == transpose(a) While this would admittedly be a bit of a syntactic hack, the operation is so commonplace and the gain in readability (in my eyes) is so significant that it would seem to me to merit serious consideration (maybe a.cc or so might also be an idea for complex conjugate notation, but I'm less convinced about that?). If the addition of a single operator for the exclusive benefit of Numeric users is rejected by the core python team, maybe it's worthwhile lobbying for some mechanism that allows users to define new operators (like e.g. in Haskell)... OK, that's all -- thanks for bearing with me through this long email. Suggestions and comments about the patch [2] and possible solutions to issues raised are greatly appreciated, regards, alexander schmolck Footnotes: [1] timing results for the patched version of Numeric[2] comparing new and old 'dot' performance: http://www.dcs.ex.ac.uk/~aschmolc/Numeric/TimingsForAtlasDot.txt [2] A patch for Numeric 21.1b that speeds up Numeric's dot function by several factors for large matrices can be found here: http://www.dcs.ex.ac.uk/~aschmolc/Numeric/ [3] ATLAS (http://math-atlas.sourceforge.net/) is a project to provide a highly optimized version of BLAS (Basic Linear Algebra Subroutines, a standard and thouroughly tested implementation of basic linear algebra operations) and some LAPACK routines. The charm of ATLAS is that it is platform-independent and yet highly optimized, which it achieves by essentially fine tuning a number of parameters until optimum performance for the _particular_ machine on which it is built is reached. As a consequence complete builds can take some time, but binary versions for ATLAS for common processors are available from http://www.netlib.org/atlas/archives (moreover, even if one decides to build ATLAS oneself, the search space can be considerably cut down if one accepts the suggested "experience" values offered during the make process). -- Alexander Schmolck Postgraduate Research Student Department of Computer Science University of Exeter A.Schmolck@gmx.net http://www.dcs.ex.ac.uk/people/aschmolc/
![](https://secure.gravatar.com/avatar/6dbe76fae6d3a68b0fc59a2fb8fa5eff.jpg?s=120&d=mm&r=g)
A.Schmolck writes:
Your comments about efficiency are well-taken. I have (in a previous life) done work on efficient (in terms of virtual memory access / paging behavior) transposes of large arrays. (Divide and conquer). Anyhow - if there were support for the operation of A*B' (and A'*B) at the C level, you wouldn't need to ever actually have a copy of the transposed array in memory - you would just exchange the roles of "i" and "j" in the computation...
I guess you haven't yet stumbled across the Matrix.py that comes with Numeric - it overrides "*" to be the dot-product. Unfortunately I don't see a really easy way to simplify the Transpose operator - at the very least you could do T = Numeric.transpose and then you're just writing T(A) instead of the long-winded version. Interestingly, the "~" operator is available, but it calls the function "__invert__". I guess it would be too weird to have ~A denote the transpose? Right now you get an error - one could set things up so that ~A was the matrix inverse of A, but we already have the A**-1 notation (among others) for that...
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Hi Alexander, [SNIP]
[Interesting stuff about notation and efficiency]
2. Numeric performs unnecessary transpose operations (prior to 20.3, I
I compromise and use np.dot, etc. myself, but that's not really relavant to the issue at hand. [More snippage] think,
I think you're a little off track here. The transpose operation doesn't normally make a copy, it just creates a new object that points to the same data, but with different stride values. So the transpose shouldn't be slow or take up more space. Numarray may well make a copy on transpose, I haven't looked into that, but I assume that at this point your are still talking about the old Numeric from the look of the code you posted.
unnecessary
Like I said, I don't think either of these should be making an extra copy unless it's happening inside multiarray.innerproduct or multiarray.matrixproduct. I haven't looked at the code for those in a _long_ time and then only glancingly, so I have no idea about that. [Faster! with Atlas] Sounds very cool.
Of the options that don't require new operators I'm somewhat fond of defining __call__ to be matrix multiply. If you toss in .t notation that you mention below, you get: (M)( (C.t)(C) ) (V.t)(u) Not perfect, but not too bad either. Note that I've tossed in some extra parentheses to make the above look better. It could actually be written: M( C.t(C) )(V.t)(u) ) But I think that's more confusing as it looks too much like a function call. (Although there is some mathematical precedent for viewing matrix multiplication as a function.) I'm a little iffy on the '.t' notation as it could get out of hand. Personally I could use cunjugate as much as transpose, and it's a similar operation -- would we also add '.c'? And possibly '.s' and '.h' for skew and Hermetian matrices? That might be a little much. The __call__ idea was not particularly popular last time, but I figured I'd toss at it out again as an easy to implement possibility. -tim
![](https://secure.gravatar.com/avatar/006646d7b469c74a341e926b294d0683.jpg?s=120&d=mm&r=g)
On Thu, 28 Feb 2002, Tim Hochberg wrote:
Numeric.transpose quickly returns an object which takes up little space. But, in many cases, when you actually use the object returned, a contiguous copy gets created. Glancing over the 21.0b3 sources, it looks like this might not happen as often as it used to, but there are still plenty of calls to PyArray_ContiguousFromObject in there, so transposition is not always as cheap as it might seem. Especially if you say Aprime=Numeric.transpose(A) and then use Aprime several times, you could end up repeatedly creating and discarding temporary transposed copies. Warren Focke
![](https://secure.gravatar.com/avatar/2a726b0de1ade0be11fb5bc5a383d71d.jpg?s=120&d=mm&r=g)
Hi, On 28 Feb 2002, A.Schmolck wrote:
Let me add a subsolution here: 0.1 Wait for scipy to mature (or better yet, help it to do that). Scipy already provides wrappers to both, Fortran and C, LAPACK and BLAS routines, though currently they are under revision. With the new wrappers to these routines you can optimize your code fragments as flexible as if using them from C or Fortran. In principle, one could mix Fortran and C routines (i.e. the corresponding wrappers) so that one avoids all unneccasary transpositions. All matrix operations can be performed in-situ if so desired. Regards, Pearu
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
On 28 Feb 2002, A.Schmolck wrote:
You are not alone in your concerns. The developers of SciPy are quite concerned about speed, hence the required linking to ATLAS. As Pearu mentioned all of the BLAS will be available (much of it is). This will enable very efficient algorithms. The question of notational elegance is stickier because we just can't add new operators. The solution I see is to use other classes. Right now, the Numeric array is an array of numbers (it is not a vector or a matrix) and that is why it has the operations it does. The Matrix class (delivered with Numeric) creates a Matrix object that uses the array of numbers of Numeric arrays. It overloads the * operator and defines .T, and .H for transpose and Hermitian transpose respectively. This requires explictly making your objects matrices (not a bad thing in my book as not all 2-D arrays fit perfectly in a matrix algebra).
The following Matlab fragment
M * (C' * C) * V' * u
This becomes (using SciPy which defines Mat = Matrix.Matrix and could later redefine it to use the ATLAS libraries for matrix multiplication). C, V, u, M = apply(Mat, (C, V, u, M)) M * (C.H * C) * V.H * M not bad.. and with a Mat class that uses the ATLAS blas (not a very hard thing to do now.), this could be made as fast as MATLAB. Perhaps, as as start we could look at how you make the current Numeric use blas if it is installed to do dot on real and complex arrays (I know you can get rid of lapack_lite and use your own lapack) but, the dot function is defined in multiarray and would have to be modified to use the BLAS instead of its own homegrown algorithm. -Travis
![](https://secure.gravatar.com/avatar/b3f3c6b5fbf298953ac2143730898e1d.jpg?s=120&d=mm&r=g)
Hi, On 28 Feb 2002, Travis Oliphant wrote:
You are not alone in your concerns. The developers of SciPy are quite concerned about speed, hence the required linking to ATLAS.
The question of notational elegance is stickier because we just can't add new operators.
The following Matlab fragment M * (C' * C) * V' * u
C, V, u, M = apply(Mat, (C, V, u, M))
M * (C.H * C) * V.H * M
Yes, much better.
This is precisely what Alex and I have done. Please see the patch to Numeric and timings on http://www.dcs.ex.ac.uk/~aschmolc/Numeric/ It's not beautiful but about 40 times faster on 1000 by 1000 matrix multiplies. I'll attempt to provide a similar patch for numarray over the next week or so. Many thanks for your comments. Richard.
![](https://secure.gravatar.com/avatar/6dbe76fae6d3a68b0fc59a2fb8fa5eff.jpg?s=120&d=mm&r=g)
A.Schmolck writes:
Your comments about efficiency are well-taken. I have (in a previous life) done work on efficient (in terms of virtual memory access / paging behavior) transposes of large arrays. (Divide and conquer). Anyhow - if there were support for the operation of A*B' (and A'*B) at the C level, you wouldn't need to ever actually have a copy of the transposed array in memory - you would just exchange the roles of "i" and "j" in the computation...
I guess you haven't yet stumbled across the Matrix.py that comes with Numeric - it overrides "*" to be the dot-product. Unfortunately I don't see a really easy way to simplify the Transpose operator - at the very least you could do T = Numeric.transpose and then you're just writing T(A) instead of the long-winded version. Interestingly, the "~" operator is available, but it calls the function "__invert__". I guess it would be too weird to have ~A denote the transpose? Right now you get an error - one could set things up so that ~A was the matrix inverse of A, but we already have the A**-1 notation (among others) for that...
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Hi Alexander, [SNIP]
[Interesting stuff about notation and efficiency]
2. Numeric performs unnecessary transpose operations (prior to 20.3, I
I compromise and use np.dot, etc. myself, but that's not really relavant to the issue at hand. [More snippage] think,
I think you're a little off track here. The transpose operation doesn't normally make a copy, it just creates a new object that points to the same data, but with different stride values. So the transpose shouldn't be slow or take up more space. Numarray may well make a copy on transpose, I haven't looked into that, but I assume that at this point your are still talking about the old Numeric from the look of the code you posted.
unnecessary
Like I said, I don't think either of these should be making an extra copy unless it's happening inside multiarray.innerproduct or multiarray.matrixproduct. I haven't looked at the code for those in a _long_ time and then only glancingly, so I have no idea about that. [Faster! with Atlas] Sounds very cool.
Of the options that don't require new operators I'm somewhat fond of defining __call__ to be matrix multiply. If you toss in .t notation that you mention below, you get: (M)( (C.t)(C) ) (V.t)(u) Not perfect, but not too bad either. Note that I've tossed in some extra parentheses to make the above look better. It could actually be written: M( C.t(C) )(V.t)(u) ) But I think that's more confusing as it looks too much like a function call. (Although there is some mathematical precedent for viewing matrix multiplication as a function.) I'm a little iffy on the '.t' notation as it could get out of hand. Personally I could use cunjugate as much as transpose, and it's a similar operation -- would we also add '.c'? And possibly '.s' and '.h' for skew and Hermetian matrices? That might be a little much. The __call__ idea was not particularly popular last time, but I figured I'd toss at it out again as an easy to implement possibility. -tim
![](https://secure.gravatar.com/avatar/006646d7b469c74a341e926b294d0683.jpg?s=120&d=mm&r=g)
On Thu, 28 Feb 2002, Tim Hochberg wrote:
Numeric.transpose quickly returns an object which takes up little space. But, in many cases, when you actually use the object returned, a contiguous copy gets created. Glancing over the 21.0b3 sources, it looks like this might not happen as often as it used to, but there are still plenty of calls to PyArray_ContiguousFromObject in there, so transposition is not always as cheap as it might seem. Especially if you say Aprime=Numeric.transpose(A) and then use Aprime several times, you could end up repeatedly creating and discarding temporary transposed copies. Warren Focke
![](https://secure.gravatar.com/avatar/2a726b0de1ade0be11fb5bc5a383d71d.jpg?s=120&d=mm&r=g)
Hi, On 28 Feb 2002, A.Schmolck wrote:
Let me add a subsolution here: 0.1 Wait for scipy to mature (or better yet, help it to do that). Scipy already provides wrappers to both, Fortran and C, LAPACK and BLAS routines, though currently they are under revision. With the new wrappers to these routines you can optimize your code fragments as flexible as if using them from C or Fortran. In principle, one could mix Fortran and C routines (i.e. the corresponding wrappers) so that one avoids all unneccasary transpositions. All matrix operations can be performed in-situ if so desired. Regards, Pearu
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
On 28 Feb 2002, A.Schmolck wrote:
You are not alone in your concerns. The developers of SciPy are quite concerned about speed, hence the required linking to ATLAS. As Pearu mentioned all of the BLAS will be available (much of it is). This will enable very efficient algorithms. The question of notational elegance is stickier because we just can't add new operators. The solution I see is to use other classes. Right now, the Numeric array is an array of numbers (it is not a vector or a matrix) and that is why it has the operations it does. The Matrix class (delivered with Numeric) creates a Matrix object that uses the array of numbers of Numeric arrays. It overloads the * operator and defines .T, and .H for transpose and Hermitian transpose respectively. This requires explictly making your objects matrices (not a bad thing in my book as not all 2-D arrays fit perfectly in a matrix algebra).
The following Matlab fragment
M * (C' * C) * V' * u
This becomes (using SciPy which defines Mat = Matrix.Matrix and could later redefine it to use the ATLAS libraries for matrix multiplication). C, V, u, M = apply(Mat, (C, V, u, M)) M * (C.H * C) * V.H * M not bad.. and with a Mat class that uses the ATLAS blas (not a very hard thing to do now.), this could be made as fast as MATLAB. Perhaps, as as start we could look at how you make the current Numeric use blas if it is installed to do dot on real and complex arrays (I know you can get rid of lapack_lite and use your own lapack) but, the dot function is defined in multiarray and would have to be modified to use the BLAS instead of its own homegrown algorithm. -Travis
![](https://secure.gravatar.com/avatar/b3f3c6b5fbf298953ac2143730898e1d.jpg?s=120&d=mm&r=g)
Hi, On 28 Feb 2002, Travis Oliphant wrote:
You are not alone in your concerns. The developers of SciPy are quite concerned about speed, hence the required linking to ATLAS.
The question of notational elegance is stickier because we just can't add new operators.
The following Matlab fragment M * (C' * C) * V' * u
C, V, u, M = apply(Mat, (C, V, u, M))
M * (C.H * C) * V.H * M
Yes, much better.
This is precisely what Alex and I have done. Please see the patch to Numeric and timings on http://www.dcs.ex.ac.uk/~aschmolc/Numeric/ It's not beautiful but about 40 times faster on 1000 by 1000 matrix multiplies. I'll attempt to provide a similar patch for numarray over the next week or so. Many thanks for your comments. Richard.
participants (7)
-
a.schmolck@gmx.net
-
Charles G Waldman
-
Pearu Peterson
-
R.M.Everson@exeter.ac.uk
-
Tim Hochberg
-
Travis Oliphant
-
Warren Focke