[Numpy-discussion] numarray interface and performance issues (for dot product and transpose)

28 Feb 2002

      Hi,

Numeric is an impressively powerful and in many respects easy and
comfortable to use package (e.g. it's sophisticated slicing
operations, not to mention the power and elegance of the underlying
python language) and one would hope that it can one day replace Matlab
(which is both expensive and a nightmare as a programming language) as
a standard platform for numerical calculations.

Two important drawbacks of Numeric (python) compared to Matlab
currently seem to be the plotting functionality and the availability
of specialist libraries (say for pattern recognition problems etc.).
However python seems to be quickly catching up in this area (e.g.
scipy and other projects are bustling with activity and, as far as I
know, there is currently a promising undergraduate project on the way
at my university to provide powerful plotting facilities for python).

There is however a problem that, for the use to which I want to put Numeric,
runs deeper and provides me with quite a headache:

Two essential matrix operations (matrix-multiplication and transposition
(which is what I am mainly using) are both considerably

a) less efficient and
b) less notationally elegant

under Numeric than under Matlab. Before I expound on that, I'd better mention
two things: Firstly, I, have to confess largely ignorant about the gritty
details of implementing efficient numerical computation (and indeed, I can't
even claim to possess a good grasp of the underlying Linear
Algebra). Secondly, and partly as a consequence, I realize that I'm likely to
view this from a rather narrow perspective. Nonetheless, I need to solve these
problems at least for myself, even if such a solution might be inappropriate
for a more general audience, so I'd be very keen to hear what suggestions or
experiences from other Numeric users are. I also want to share what my phd
supervisor, Richard Everson, and I have come up with so far in terms of a), as
I think it might be quite useful for other people, too (see [2], more about
this later).

Ok, so now let me proceed to state clearly what I mean by a) and b):

The following Matlab fragment

  M * (C' * C) * V' * u

currently has to be written like this in python:

  dot(dot(dot(M, dot(transpose(C), C)), transpose(v)), u)

Or, even worse if one doesn't want to pollute the namespace:

  Numeric.dot(Numeric.dot(Numeric.dot(Numeric.M,
              Numeric.dot(Numeric.transpose(C), C)), Numeric.transpose(v)), u)

Part a) (efficiency)
--------------------

Apart from the syntactic inconveniences, the Matlab above will execute
__considerably__ faster if C is a rather larger matrix (say
1000x1000). There are AFAIK two chief reasons for this:

1. Matlab uses a highly optimized ATLAS[3] BLAS routine to compute the
   dot-product, whereas Numeric uses a straight-forward home-brewn solution.

2. Numeric performs unnecessary transpose operations (prior to 20.3, I think,
   more about this later). The transpose operation is really damaging with big
   matrices, because it creates a complete copy, rather than trying to do
   something lazy (if your memory is already almost half filled up with
   (matrix) C, then creating a (in principle superfluous) transposed copy is
   not going to do you any good). The above C' * C actually creates, AFAIK,
   _3_ versions of C, 2 of them transposed (prior to 20.3;

     dot(a,b)

   translates into

     innerproduct(a, swapaxes(b, -1, -2))

   In newer versions of Numeric, this is replaced by

     multiarray.matrixproduct(a, b)

   which has the considerable advantage that it doesn't create an unnecessary
   copy and the considerable disadvantage that it seems to be factor 3 or so
   slower than the old (already not blazingly fast) version for large Matrix x
   Matrix multiplication, (see timing results [1])).

Now fortunately, we made some significant progress on the performance issues.
My supervisor valiantly patched Numeric's multiarrayobject.c to use ATLAS for
{scalar,vector,matrix} * {scalar,vector,matrix} multiplications.  As a
consequence the multiplication of pair of 1000x1000 matrices is computed more
than _40_ times faster on my athlon machine (see [1], it seems to work fine
but both the timing results as well as the code should be treated with a bit
of caution). The BLAS-enabled dot takes almost exactly the same time as Matlab
doing the same thing.  In my case, this makes the difference between waiting
several minutes and a day, and between deciding to use python or something
faster but more painful.

Although to benefit from this patch, one has to install ATLAS this is
straight-forward (see [2] and [3] for pointers).  In addition it is easy to
install Lapack at the same time, as ATLAS provides significantly optimised
versions of some of the Lapack routines too. Using Lapack rather than the
lapack_lite that comes with Numeric also has the advantage that it is
extensively tested and works very reliably (e.g. the lapack_lite routine that
is called by Heigenvalues is pretty broken and searching the net suggests that
this has been the case for some time now).

This still leaves point 2), transposes, as a potential source for
improvement. As far as I understand, the Lapack routines operate on flattened
arrays, i.e. the only difference on calling some routine on a matrix and its
transpose is that one has to pass a different additional parameters in the
second case that specifies the structure of the matrix. Therefore if one were
to program C' * C in fortran, only C itself needed to exist in memory. Would
it therefore be possible that operations like transpose, slicing etc. be
carried out lazily (i.e. so that only _modifying_ the result of the transpose
or slice or passing it to some function for some reason needs a real copy
resulted in an actual copy operation being performed)? As far as I understand
the slicing in numarray is no longer simple aliasing (which is a good thing,
since it make arrays more compatible with lists), are there any plans for
implementing such an on-demand scheme?

Part b) (syntax)
________________

As I said,

  dot(dot(dot(M, dot(transpose(C), C)), transpose(v)), u)

is pretty obscure compared to

  M * (C' * C) * V' * u)

Although linear algebra is not the be all and end all of numerical
calculations, many numerical algorithms end up involving a good deal of linear
algebra. The awkwardness of python/Numeric's notation for common linear
algebra operation severely impairs, in my opinion, the utility of writing,
using and debugging algorithms in python/Numeric.  This is a particular shame
given python's generally elegant and expresssive syntax.

So far, I've thought of the following possible solutions:

0. Do nothing:
   Just live with the awkward syntax.

1. Comment:
   add a comment like the Matlab line above above complex expressions.

2. Wait for fix. Hope that either:

  2.1. numarray will overload '*' as matrix multiplication,
       rather than elementwise multiplication (highly unlikely, I suspect).
  2.2. the core python team accepts one of the proposed operator additions (or
       even, fond phantasy, allows users to add new operators)

3. Wrap: create a DotMatrix class that overloads '*' to be dot and maybe
   self.t to return the transpose -- this also means that all the numerical
   libraries I frequently use need to be wrapped.

4. Preprocess:
   Write some preprocessor that translates e.g. A ~* B in dot(A,B).

5. Hack: create own customized version of array that overloads e.g. '<<' to
   mean dot-product.

The possible downsides:

0. Do nothing:
   Seems unacceptable (unless as a temporary solution), because the code
   becomes much less readable (and consequently harder to understand and more
   error-prone), thus forgoing one of the main reasons to choose python in the
   first place.

1. Comment:
   Unsatisfactory, because there is no way to gurantee that comment and code
   are in synch, which very likely will lead to difficult to find bugs.

2. Wait for a fix: Either would be nice (I could live with having to write
   multiply(a,b) -- I suppose however some other people can't, because I can't
   think of another reason why Numeric didn't overload * for matrixproduct in
   the first place, moreover it would mean a significant interface change (and
   code breakage), so I guess it is rather unlikely). From what I gather from
   searching the net and looking at peps there is also not much of a chance to
   get a new operator anytime soon.

3. Wrap:
   Promises to be a maintenance nightmare (BTW: will the new array class be
   subclassable?), but otherwise looks feasible. Has anyone done this?

4. Preprocess:
   Would have the advantage that I could always get back to "standard"
   python/Numeric code, but would involve quite a bit of work and might also
   break tools that parse python.

5. Hack array:
   Seems evil, but in principle feasible, because AFAIK '<<' isn't used in
   Numeric itself and hopefully it wouldn't involve too much work.
   However, given a free choice '<<' is hardly the operator I would
   choose to represent the dot product.

I am not completely sure what the policy and rationale behind the
current division of labor between functions, methods and attributes in
numarray is but as far as the lack of a missing transposition postfix
operator is concerned, one reasonable approach to make the transpose
operation more readable (without the need to change anything in python
itself) would seem to me to provide arrays with an attribute .t or .T
so that:

  a.t == transpose(a)

While this would admittedly be a bit of a syntactic hack, the
operation is so commonplace and the gain in readability (in my eyes)
is so significant that it would seem to me to merit serious
consideration (maybe a.cc or so might also be an idea for complex
conjugate notation, but I'm less convinced about that?).

If the addition of a single operator for the exclusive benefit of Numeric
users is rejected by the core python team, maybe it's worthwhile lobbying for
some mechanism that allows users to define new operators (like
e.g. in Haskell)...

OK, that's all -- thanks for bearing with me through this long
email. Suggestions and comments about the patch [2] and possible solutions
to issues raised are greatly appreciated,

regards,

alexander schmolck

Footnotes: 
[1] timing results for the patched version of Numeric[2] comparing new and old
    'dot' performance:
    http://www.dcs.ex.ac.uk/~aschmolc/Numeric/TimingsForAtlasDot.txt

[2] A patch for Numeric 21.1b that speeds up Numeric's dot function by several
    factors for large matrices can be found here:
    http://www.dcs.ex.ac.uk/~aschmolc/Numeric/

[3] ATLAS (http://math-atlas.sourceforge.net/) is a project to provide a
    highly optimized version of BLAS (Basic Linear Algebra
    Subroutines, a standard and thouroughly tested
    implementation of basic linear algebra operations) and some LAPACK
    routines. The charm of ATLAS is that it is platform-independent and yet
    highly optimized, which it achieves by essentially fine tuning a number of
    parameters until optimum performance for the _particular_ machine on which
    it is built is reached. As a consequence complete builds can take some
    time, but binary versions for ATLAS for common processors are available
    from http://www.netlib.org/atlas/archives (moreover, even if one decides
    to build ATLAS oneself, the search space can be considerably cut down if
    one accepts the suggested "experience" values offered during the make
    process).

-- 
Alexander Schmolck     Postgraduate Research Student
                       Department of Computer Science
                       University of Exeter
A.Schmolck@gmx.net     http://www.dcs.ex.ac.uk/people/aschmolc/

[Numpy-discussion] numarray interface and performance issues (for dot product and transpose)

a.schmolck＠gmx.net