Efficient operator overloading
On 4/18/2007 7:33 AM, Anne Archibald wrote:
copying. And the scope of improvement would be very limited; an expression like A*B+C*D would be much more efficient, probably, if the whole expression were evaluated at once for each element (due to memory locality and temporary allocation). But it is impossible for numpy, sitting inside python as it does, to do that.
Most numerical array/matrix libraries dependent on operator overloading generates temporaries. That is why Fortran is usually perceived as superior to C++ for scientific programming. The Fortran compiler knows about arrays and can avoid allocating three temporary arrays to evaluate and expression like y = a * b + c * d If this expression is evaluated bya Fortran 90/95 compiler, it will automatically generate code like do i = 1,n y(i) = a(i) * b(i) + c(i) * d(i) enddo On the other hand, conventional use of overloaded operators would result in something like this: allocate(tmp1,n) do i = 1,n tmp1(i) = a(i) * b(i) enddo allocate(tmp2,n) do i = 1,n tmp2(i) = c(i) * d(i) enddo allocate(tmp3,n) do i = 1,n tmp3(i) = tmp1(i) + tmp2(i) enddo deallocate(tmp1) deallocate(tmp2) do i = 1,n y(i) = tmp3(i) enddo deallocate(tmp3) Traversing memory is one of the most expensive thing a CPU can do. This approach is therefore extremely inefficient compared with what a Fortran compiler can do. This does not mean that all use of operator overloading is inherently bad. Notably, there is a C++ numerical library called Blitz++ which can avoid these tempraries for small fixed-size arrays. As it depends on template metaprogramming, the size must be known at compile time. But if this is the case, it can be just as efficient as Fortran if the optimizer is smart enough to remove the redundant operations. Most modern C++ compilers is smart enough to do this. Note that it only works for fixed size arrays. Fortran compilers can do this on a more general basis. It is therefore advisable to have array syntax built into the language syntax it self, as in Fortran 90/95 and Ada 95. But if we implement the operator overloading a bit more intelligent, it should be possible to get rid of most of the temporary arrays. We could replace temporary arrays with an "unevaluated expression class" and let the library pretend it is a compiler. Let us assume again we have an expression like y = a * b + c * d where a,b,c and d are all arrays or matrices. In this case, the overloaded * and + operators woud not return a temporary array but an unevaluated expression of class Expr. Thus we would get tmp1 = Expr('__mul__',a,b) # symbolic representation of 'a * b' tmp2 = Expr('__mul__',c,d) # symbolic representation of 'c * d' tmp3 = Expr('__add__',tmp1,tmp1) # symbolic "a * b + c * d" del tmp1 del tmp2 y = tmp3 # y becomes a reference to an unevaluated expression Finally, we need a mechanism to 'flush' the unevaluated expression. Python does not allow the assignment operator to be evaluated, so one could not depend on that. But one could a 'flush on read/write' mechanism, and let an Expr object exist in different finite states (e.g. unevaluated and evaluated). If anyone tries to read an element from y or change any of the objects it involves, the expression gets evaluated without temporaries. Before that, there is no need to evalute the expression at all! We can just keep a symbolic representation of it. Procrastination is good! Thus later on... x = y[i] # The expression 'a * b + c * d' gets evaluated. The # object referred to by y now holds an actual array. or a[i] = 2.0 # The expression 'a * b + c * d' gets evaluated. The # object referred to by y now holds an actual array. # Finally, 2.0 is written to a[i]. or y[i] = 2.0 # The expression 'a * b + c * d' gets evaluated. The # object referred to by y now holds an actual array. # Finally, 2.0 is written to y[i]. When the expression 'a * b + c * d' is finally evaluated, we should through symbolic manipulation get something (at least close to) a single efficient loop: do i = 1,n y(i) = a(i) * b(i) + c(i) * d(i) enddo I'd really like to see a Python extension library do this one day. It would be very cool and (almost) as efficient as plain Fortran - though not quite, we would still get some small temporary objects created. But that is a sacrifice I am willing to pay to use Python. We would gain some efficacy over Fortran by postponing indefinitely evaluation of computations that are not needed, when this is not known at compile time. Any comments? Sturla Molden Ph.D.
This does not mean that all use of operator overloading is inherently bad. Notably, there is a C++ numerical library called Blitz++ which can avoid these tempraries for small fixed-size arrays. As it depends on template metaprogramming, the size must be known at compile time. But if this is the case, it can be just as efficient as Fortran if the optimizer is smart enough to remove the redundant operations. Most modern C++ compilers is smart enough to do this. Note that it only works for fixed size arrays. Fortran compilers can do this on a more general basis. It is therefore advisable to have array syntax built into the language syntax it self, as in Fortran 90/95 and Ada 95.
Boost.ublas uses template expressions, I implemented them in my own matrix lib, it's very simple to do. Even for big matrix, it can be interesting. If the size is static, bothering with fixed-size arrays is useless, the compilers do optimize the temporaries ; but using template expression with iterators for general arrays allows to approach fixed-size arrays speed. Matthieu
On 4/18/07, Sturla Molden
On 4/18/2007 7:33 AM, Anne Archibald wrote:
copying. And the scope of improvement would be very limited; an expression like A*B+C*D would be much more efficient, probably, if the whole expression were evaluated at once for each element (due to memory locality and temporary allocation). But it is impossible for numpy, sitting inside python as it does, to do that.
Most numerical array/matrix libraries dependent on operator overloading generates temporaries. That is why Fortran is usually perceived as superior to C++ for scientific programming. The Fortran compiler knows about arrays and can avoid allocating three temporary arrays to evaluate and expression like
y = a * b + c * d [SNIP]
I'd really like to see a Python extension library do this one day. It would be very cool and (almost) as efficient as plain Fortran - though not quite, we would still get some small temporary objects created. But that is a sacrifice I am willing to pay to use Python. We would gain some efficacy over Fortran by postponing indefinitely evaluation of computations that are not needed, when this is not known at compile time.
Any comments?
I periodically wonder if something like this could be built on top of numexpr without too much pain. Simply build up the expression on operations and then evaluate when data is requested. You'd need some sort of Jekyl & Hyde matrix that spent the first part of it's life unevaluated and then converted itself to something very much like an array when it evaluated itself. If might end up being tricky to make sure that everything got transparently evaluated at the right time though. I've never had time to try it though. Sturla Molden
Ph.D.
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- //=][=\\ tim.hochberg@ieee.org
Timothy Hochberg wrote:
On 4/18/07, *Sturla Molden*
mailto:sturla@molden.no> wrote: On 4/18/2007 7:33 AM, Anne Archibald wrote:
>copying. And the scope of improvement would be very limited; an expression like A*B+C*D would be much more efficient, probably, if the whole expression were evaluated at once for each element (due to memory locality and temporary allocation). But it is impossible for numpy, sitting inside python as it does, to do that.
Most numerical array/matrix libraries dependent on operator overloading generates temporaries. That is why Fortran is usually perceived as superior to C++ for scientific programming. The Fortran compiler knows about arrays and can avoid allocating three temporary arrays to evaluate and expression like
y = a * b + c * d [SNIP]
I'd really like to see a Python extension library do this one day. It would be very cool and (almost) as efficient as plain Fortran - though not quite, we would still get some small temporary objects created. But that is a sacrifice I am willing to pay to use Python. We would gain some efficacy over Fortran by postponing indefinitely evaluation of computations that are not needed, when this is not known at compile time.
Any comments?
I periodically wonder if something like this could be built on top of numexpr without too much pain. Simply build up the expression on operations and then evaluate when data is requested. You'd need some sort of Jekyl & Hyde matrix that spent the first part of it's life unevaluated and then converted itself to something very much like an array when it evaluated itself. If might end up being tricky to make sure that everything got transparently evaluated at the right time though.
I've never had time to try it though.
Sturla Molden Ph.D.
Hi, Correct me if I'm wrong but it is not possible to code something working like boost++ in python. It is due to the fact we cannot overload the = operator in python. As a result you cannot implement the way boost++ works : + * - / operators build a tree and all the loops are done in the = operator code. I do not say that it is impossible to code an efficient way to deal with operators and matrix in python but only that it looks not possible to match the boost++ way of thinking. Comments? Xavier -- ############################################ Xavier Gnata CRAL - Observatoire de Lyon 9, avenue Charles André 69561 Saint Genis Laval cedex Phone: +33 4 78 86 85 28 Fax: +33 4 78 86 83 86 E-mail: gnata@obs.univ-lyon1.fr ############################################
Timothy Hochberg wrote:
Correct me if I'm wrong but it is not possible to code something working like boost++ in python. It is due to the fact we cannot overload the = operator in python. As a result you cannot implement the way boost++ works : + * - / operators build a tree and all the loops are done in the = operator code.
I do not say that it is impossible to code an efficient way to deal with operators and matrix in python but only that it looks not possible to match the boost++ way of thinking.
I have three comments: 1. Boost++ depends on template metaprogramming. For this to work, array bounds must be known in advance. If you don't know the dimensions in advance, the compiler cannot get rid of the temporary storage and redundant loops. Basically, Boost++ and Blitz++ can be efficient for small tight loops. But for small arrays, Boost++ and Blitz++ can do miracles. 2. Template metaprogramming and symbolic maths are very different subjects. Evaluating symbols run-time is very different for evaluating templates compile-time. If you know the eval() function in Python and Lisp it is probably easier to understand what I suggested; eval() and Lisp macros are not generally understood by the C++ community. 3. My suggestion did not require overloading the assignment operator (=). I suggested to let the "Jekyll & Hyde array" evaluate itself when it's content is requested (e.g. using the indexing operator [] as one such signal). Actually, not overloading the assignment operator is important as it allows us to procrastinate evaluation of expressions as long as we can. It may seem strange, but it is an important optimization technique. The things you don't do don't take any time.
Sturla Molden wrote:
Timothy Hochberg wrote:
Correct me if I'm wrong but it is not possible to code something working like boost++ in python. It is due to the fact we cannot overload the = operator in python. As a result you cannot implement the way boost++ works : + * - / operators build a tree and all the loops are done in the = operator code.
I do not say that it is impossible to code an efficient way to deal with operators and matrix in python but only that it looks not possible to match the boost++ way of thinking.
I have three comments:
1. Boost++ depends on template metaprogramming. For this to work, array bounds must be known in advance. If you don't know the dimensions in advance, the compiler cannot get rid of the temporary storage and redundant loops. Basically, Boost++ and Blitz++ can be efficient for small tight loops. But for small arrays, Boost++ and Blitz++ can do miracles.
2. Template metaprogramming and symbolic maths are very different subjects. Evaluating symbols run-time is very different for evaluating templates compile-time. If you know the eval() function in Python and Lisp it is probably easier to understand what I suggested; eval() and Lisp macros are not generally understood by the C++ community.
3. My suggestion did not require overloading the assignment operator (=). I suggested to let the "Jekyll & Hyde array" evaluate itself when it's content is requested (e.g. using the indexing operator [] as one such signal). Actually, not overloading the assignment operator is important as it allows us to procrastinate evaluation of expressions as long as we can. It may seem strange, but it is an important optimization technique. The things you don't do don't take any time.
yes yes and yes :). I don't know if I'm a python or a c++ programmer but I fully understand what eval() is. eval() is not evil in a python/lisp context. Not at all! I do like to have the capability to play with str and eval in python (I also do like template meta prog...) My point was only to say that the way to provide the user with this lazzy operators can't be the same as in c++. Sorry if my message was not clear :(. What you want is symbolics maths with matrix and a way to trigger the evaluation. Well, as far as I have seen for now, it looks pretty hard to code that with a pythonic syntax and a way to report in case of errors. It is also pretty hard to avoid nasty infinite recursions playing with "to be evaluated objects". I'm going to have a look to this scipy sandbox module. Xavier -- ############################################ Xavier Gnata CRAL - Observatoire de Lyon 9, avenue Charles André 69561 Saint Genis Laval cedex Phone: +33 4 78 86 85 28 Fax: +33 4 78 86 83 86 E-mail: gnata@obs.univ-lyon1.fr ############################################
On Apr 18, 2007, at 6:31 PM, Timothy Hochberg wrote:
On 4/18/07, Sturla Molden
wrote: I'd really like to see a Python extension library do this one day. It would be very cool and (almost) as efficient as plain Fortran - though not quite, we would still get some small temporary objects created. But that is a sacrifice I am willing to pay to use Python. We would gain some efficacy over Fortran by postponing indefinitely evaluation of computations that are not needed, when this is not known at compile time.
Any comments?
I periodically wonder if something like this could be built on top of numexpr without too much pain. Simply build up the expression on operations and then evaluate when data is requested. You'd need some sort of Jekyl & Hyde matrix that spent the first part of it's life unevaluated and then converted itself to something very much like an array when it evaluated itself. If might end up being tricky to make sure that everything got transparently evaluated at the right time though.
It seems that we have developed something like this for the fipy code (http://www.ctcms.nist.gov/fipy). We refer to it as "lazy evaluation". Essentially, objects called "Variable"s hold numpy arrays. They behave as ordinary numpy arrays, but are only evaluated when the information is requested. e.g. >>> from fipy import * >>> v0 = Variable(numerix.array((0,1,2))) >>> v1 = Variable(numerix.array((2,3,4))) >>> v2 = v0 * v1 + v0 * v1 >>> v2 ((Variable(value = array([0, 1, 2])) * Variable(value = array([3, 4, 5]))) + (Variable(value = array([0, 1, 2])) * Variable(value = array([3, 4, 5])))) >>> print v2 [ 0 8 20] By passing an '--inline' flag at run time we can choose to evaluate this expression with weave. Variable's have a _getCstring() method that's used to pass to weave when inlining along with other information. >>> print v2._getCstring() '((var00(i) * var01(i)) + (var10(i) * var11(i)))' Actually, we could and should have been using numexpr to do this. I think numexpr would substantially clean up the code we've developed. Anyway, you can browse variable.py at http://matdl-osi.org/fipy/browser/trunk/fipy/variables/variable.py. Cheers -- Daniel Wheeler
In the SciPy sandbox there is a module that goes into this direction. It was mentioned in the other optimization thread. numexpr: http://projects.scipy.org/scipy/scipy/browser/trunk/Lib/sandbox/numexpr This module needs to be combined with a derived array class where the operators return unevaluated expressions (lazy evaluation). All access methods and asarray must call the evaluation function. Like you proposed. The unevaluated expressions and a compiler for them are in the numexpr module. A very basic version of this would probably be not much work. I would like to replace: import numpy as N by: import lazynumpy as N :-) Regards Eike.
On Apr 18, 2007, at 2:14 PM, Sturla Molden wrote: [...]
Let us assume again we have an expression like
y = a * b + c * d
where a,b,c and d are all arrays or matrices. In this case, the overloaded * and + operators woud not return a temporary array but an unevaluated expression of class Expr. Thus we would get
tmp1 = Expr('__mul__',a,b) # symbolic representation of 'a * b' tmp2 = Expr('__mul__',c,d) # symbolic representation of 'c * d' tmp3 = Expr('__add__',tmp1,tmp1) # symbolic "a * b + c * d" del tmp1 del tmp2 y = tmp3 # y becomes a reference to an unevaluated expression
Finally, we need a mechanism to 'flush' the unevaluated expression. Python does not allow the assignment operator to be evaluated, so one could not depend on that. But one could a 'flush on read/write'
At the risk of being remembered only for dredging up hacks, there is something similar that would at least serve for someone to try work along these lines. Nothing prevents you from overloading one of the augmented operators (e.g., |=) to trigger the evaluation. True, it uses up one of these that can be used for a numeric purpose, but it can at least be used to test the idea to see how workable it is (I forget who originally suggested this but I heard it suggested it at the Long Beach Python Conference; it might have been Tim Peters). Perry
On Wednesday 18 April 2007 20:14, Sturla Molden wrote:
a[i] = 2.0 # The expression 'a * b + c * d' gets evaluated. The # object referred to by y now holds an actual array. # Finally, 2.0 is written to a[i].
This case will require some extra work. The array needs to remember, that there is an unevaluated expression depending on it. Regards Eike.
On Wednesday 18 April 2007 20:14, Sturla Molden wrote:
This case will require some extra work. The array needs to remember, that there is an unevaluated expression depending on it.
This is certainly a complication that needs to be worked out. An array could store a set of dependent expressions, but I don't know if it's efficient enough. The most naïve solution would be to make arrays immutable (copy-on-write) like strings. But that is not very appealing. Actually, Matlab's matrices behaves exactly like this - and it isn't that bad - but I much rather prefer mutable NumPy arrays.
participants (7)
-
Daniel Wheeler
-
Eike Welk
-
Matthieu Brucher
-
Perry Greenfield
-
Sturla Molden
-
Timothy Hochberg
-
Xavier Gnata