Numpy and PEP 343
<pie-in-the-sky> An idea that has popped up from time to time is delaying evalution of a complicated expressions so that the result can be computed more efficiently. For instance, the matrix expression: a = b*c + d*e results in the creation of two, potentially large, temporary matrices and also does a couple of extra loops at the C level than the equivalent expression implemented in C would. The general idea has been to construct some sort of psuedo-object, when the numerical operations are indicated, then do the actual numerical operations at some later time. This would be very problematic if implemented for all arrays since it would quickly become impossible to figure out what was going on, particularly with view semantics. However, it could result in large performance improvements without becoming incomprehensible if implemented in small enough chunks. A "straightforward" approach would look something like: numpy.begin_defer() # Now all numpy operations (in this thread) are deferred a = b*c + d*e # 'a' is a special object that holds pointers to # 'b', 'c', 'd' and 'e' and knows what ops to perform. numpy.end_defer() # 'a' performs the operations and now looks like an array Since 'a' knows the whole series of operations in advance it can perform them more efficiently than would be possible using the basic numpy machinery. Ideally, the space for 'a' could be allocated up front, and all of the operations could be done in a single loop. In practice the optimization might be somewhat less ambitious, depending on how much energy people put into this. However, this approach has some problems. One is the syntax, which clunky and a bit unsafe (a missing end_defer in a function could cause stuff to break very far away). The other is that I suspect that this sort of deferred evaluation makes multiple views of an array even more likely to bite the unwary. The syntax issue can be cleanly addressed now that PEP 343 (the 'with' statement) is going into Python 2.5. Thus the above would look like: with numpy.deferral(): a = b*c + d*e Just removing the extra allocation of temporary variables can result in 30% speedup for this case[1], so the payoff would likely be large. On the down side, it could be quite a can of worms, and would likely require a lot of work to implement. Food for thought anyway. </pie-in-the-sky> -tim [1] from timeit import Timer print Timer('a = b*c + d*e', 'from numpy import arange;b=c=d=e=arange(100000.)').timeit(10000) print Timer('a = b*c; multiply(d,e,temp); a+=temp', 'from numpy import arange, zeros, multiply;' 'b=c=d=e=arange(100000.);temp=zeros([100000], dtype=float)').timeit(10000) => 94.8665989672 62.6143562939
Tim Hochberg <tim.hochberg@cox.net> writes:
<pie-in-the-sky>
An idea that has popped up from time to time is delaying evalution of a complicated expressions so that the result can be computed more efficiently. For instance, the matrix expression:
a = b*c + d*e
results in the creation of two, potentially large, temporary matrices and also does a couple of extra loops at the C level than the equivalent expression implemented in C would.
The general idea has been to construct some sort of psuedo-object, when the numerical operations are indicated, then do the actual numerical operations at some later time. This would be very problematic if implemented for all arrays since it would quickly become impossible to figure out what was going on, particularly with view semantics. However, it could result in large performance improvements without becoming incomprehensible if implemented in small enough chunks.
A "straightforward" approach would look something like:
numpy.begin_defer() # Now all numpy operations (in this thread) are deferred a = b*c + d*e # 'a' is a special object that holds pointers to # 'b', 'c', 'd' and 'e' and knows what ops to perform. numpy.end_defer() # 'a' performs the operations and now looks like an array
Since 'a' knows the whole series of operations in advance it can perform them more efficiently than would be possible using the basic numpy machinery. Ideally, the space for 'a' could be allocated up front, and all of the operations could be done in a single loop. In practice the optimization might be somewhat less ambitious, depending on how much energy people put into this. However, this approach has some problems. One is the syntax, which clunky and a bit unsafe (a missing end_defer in a function could cause stuff to break very far away). The other is that I suspect that this sort of deferred evaluation makes multiple views of an array even more likely to bite the unwary.
This is a good idea; probably a bit difficult. I don't like the global defer context though. That could get messy, especially if you start calling functions.
The syntax issue can be cleanly addressed now that PEP 343 (the 'with' statement) is going into Python 2.5. Thus the above would look like:
with numpy.deferral(): a = b*c + d*e
Just removing the extra allocation of temporary variables can result in 30% speedup for this case[1], so the payoff would likely be large. On the down side, it could be quite a can of worms, and would likely require a lot of work to implement.
Alternatively, make some sort of expression type: ex = VirtualExpression() ex.a = ex.b * ex.c + ex.d * ex.e then, compute = ex.compile(a=(shape_of_a, dtype_of_a), etc.....) This could return a function that would look like def compute(b, c, d, e): a = empty(shape_of_a, dtype=dtype_of_a) multiply(b, c, a) # ok, I'm making this one up :-) fused_multiply_add(d, e, a) return a a = compute(b, c, d, e) Or, it could use some sort of numeric-specific bytecode that can be interpreted quickly in C. With some sort of optimizing compiler for that bytecode it could be really fun (it could use BLAS when appropriate, for instance!). or ... use weave :-) -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
<pie-in-the-sky>
An idea that has popped up from time to time is delaying evalution of a complicated expressions so that the result can be computed more efficiently. For instance, the matrix expression:
a = b*c + d*e
results in the creation of two, potentially large, temporary matrices and also does a couple of extra loops at the C level than the equivalent expression implemented in C would.
The general idea has been to construct some sort of psuedo-object, when the numerical operations are indicated, then do the actual numerical operations at some later time. This would be very problematic if implemented for all arrays since it would quickly become impossible to figure out what was going on, particularly with view semantics. However, it could result in large performance improvements without becoming incomprehensible if implemented in small enough chunks.
A "straightforward" approach would look something like:
numpy.begin_defer() # Now all numpy operations (in this thread) are deferred a = b*c + d*e # 'a' is a special object that holds pointers to # 'b', 'c', 'd' and 'e' and knows what ops to perform. numpy.end_defer() # 'a' performs the operations and now looks like an array
Since 'a' knows the whole series of operations in advance it can perform them more efficiently than would be possible using the basic numpy machinery. Ideally, the space for 'a' could be allocated up front, and all of the operations could be done in a single loop. In practice the optimization might be somewhat less ambitious, depending on how much energy people put into this. However, this approach has some problems. One is the syntax, which clunky and a bit unsafe (a missing end_defer in a function could cause stuff to break very far away). The other is that I suspect that this sort of deferred evaluation makes multiple views of an array even more likely to bite the unwary.
This is a good idea; probably a bit difficult.
It's not original. I think this idea comes around periodically, but dies from a combination of it being nontrivial and the resulting syntax being too heavyweight.
I don't like the global defer context though. That could get messy, especially if you start calling functions.
I'm not crazy about it either. You could localize it with an appropriate (ab)use of sys._getframe, but that's another potential can of worms. Something like: class deferral: frames = set() def __enter__(self): self.frame = sys._getframe(1) self.frames.add(self.frame) def __exit__(self, *args): self.frames.discard(self.frame) self.frame = None def should_defer(): return (sys._getframe(1) in deferral.frames) Then: with deferral(): #stuff Should be localized to just 'stuff', even if it calls other functions[1]. The details might be sticky though....
The syntax issue can be cleanly addressed now that PEP 343 (the 'with' statement) is going into Python 2.5. Thus the above would look like:
with numpy.deferral(): a = b*c + d*e
Just removing the extra allocation of temporary variables can result in 30% speedup for this case[1], so the payoff would likely be large. On the down side, it could be quite a can of worms, and would likely require a lot of work to implement.
Alternatively, make some sort of expression type:
ex = VirtualExpression()
ex.a = ex.b * ex.c + ex.d * ex.e
then,
compute = ex.compile(a=(shape_of_a, dtype_of_a), etc.....)
This could return a function that would look like
def compute(b, c, d, e): a = empty(shape_of_a, dtype=dtype_of_a) multiply(b, c, a) # ok, I'm making this one up :-) fused_multiply_add(d, e, a) return a
a = compute(b, c, d, e)
The syntax seems too heavy too me. It would be signifigantly lighter if the explicit compile step is optional, allowing: ex = VirtualExpression() ex.a = ex.b * ex.c + ex.d * ex.e a = ex(b=b, c=c, d=d, e=e) 'ex' could then figure out all of the sizes and types itself, create the function, compute the result. The created function would be cached and whenever the input parameters matched it would just be reused, so there shouldn't be too much more overhead than with compiled version you suggest. The syntax is still heavy relative to the 'with' version though.
Or, it could use some sort of numeric-specific bytecode that can be interpreted quickly in C. With some sort of optimizing compiler for that bytecode it could be really fun (it could use BLAS when appropriate, for instance!).
or ... use weave :-)
I'll have to look at weave again. Last time I looked at it (quite a while ago) it didn't work for me. I can't recall if it was a licensing issue or it didn't work with my compiler or what, but I decided I couldn't use it. -tim [1] Here's an example that fakes with and tests deferral: import sys class deferral: frames = set() def __enter__(self): self.frame = sys._getframe(1) self.frames.add(self.frame) def __exit__(self, *args): self.frames.discard(self.frame) self.frame = None def should_defer(): return (sys._getframe(1) in deferral.frames) def f(n): if not n: return if n % 4: print "should_defer() =", should_defer(), "for n =", n f(n-1) else: # This is a rough translation of: # with deferral(): # print "should_defer() =", should_defer(), "in f" # g(n-1) d = deferral() d.__enter__() try: print "should_defer() =", should_defer(), "for n =", n f(n-1) finally: d.__exit__(None, None, None) f(10)
Tim Hochberg wrote: [SNIP] Ugh. This last part got mangled somehow.
[1] Here's an example that fakes with and tests deferral:
import sys
class deferral: frames = set() def __enter__(self): self.frame = sys._getframe(1) self.frames.add(self.frame) def __exit__(self, *args): self.frames.discard(self.frame) self.frame = None def should_defer(): return (sys._getframe(1) in deferral.frames)
[This got all jammed together, sorry]
def f(n): if not n: return if n % 4: print "should_defer() =", should_defer(), "for n =", n f(n-1) else: # This is a rough translation of: # with deferral(): # print "should_defer() =", should_defer(), "in f" # g(n-1) d = deferral() d.__enter__() try: print "should_defer() =", should_defer(), "for n =", n f(n-1) finally: d.__exit__(None, None, None)
f(10)
------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
You're reinventing C++ expression templates, although since Python is dynamically typed you don't need templates. The crucial feature in C++ that lets it all work is that you can override the action for assignment. a = b*c + d*e If we could realize we were at the "equals" sign we could evaluate the RHS, and assign it to a. This is not possible in Python; to make is possible would require slowing down regular assignment, which is perhaps a definition of bad. a[...] = RHS could be overridden but it is ugly and 'naive' users will forget. a := RHS could be added to the language with the semantics that it tries to do a.__assignment__(RHS) but Guido told me no long ago. (:->. Also, you might forget the : in :=. a.assign(RHS) would also work but then the original statement would produce a strange object with surprising results. David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
<pie-in-the-sky>
An idea that has popped up from time to time is delaying evalution of a complicated expressions so that the result can be computed more efficiently. For instance, the matrix expression:
a = b*c + d*e
results in the creation of two, potentially large, temporary matrices and also does a couple of extra loops at the C level than the equivalent expression implemented in C would.
The general idea has been to construct some sort of psuedo-object, when the numerical operations are indicated, then do the actual <snip>
Lazy evaluation has been part of many array languages since early days of APL (which makes this idea almost 50 years old). I was entertaining an idea of bringing lazy evaluation to python myself and concluded that there are two places where it might fit 1. At the level of python optimizer a * x + y, for example, can be translated into a call to a call to axpy if a, x and y are known to be arrays. This aproach quickly brings you to the optional static typing idea. <http://www.artima.com/weblogs/viewpost.jsp?thread=85551> 2. Overload arithmetic operators for ufunc objects. This will allow some form of tacit programming <http://elliscave.com/APL_J/IversonAPL.htm> and you would be able to write f = multiply + multiply f(x, y, z, t) and have it evaluated without temporaries. Both of these ideas are from the pie-in-the-sky variety. On 2/28/06, Paul F. Dubois <paul@pfdubois.com> wrote:
You're reinventing C++ expression templates, although since Python is dynamically typed you don't need templates. The crucial feature in C++ that lets it all work is that you can override the action for assignment.
a = b*c + d*e
If we could realize we were at the "equals" sign we could evaluate the RHS, and assign it to a.
This is not possible in Python; to make is possible would require slowing down regular assignment, which is perhaps a definition of bad.
a[...] = RHS
could be overridden but it is ugly and 'naive' users will forget.
a := RHS
could be added to the language with the semantics that it tries to do a.__assignment__(RHS) but Guido told me no long ago. (:->. Also, you might forget the : in :=.
a.assign(RHS)
would also work but then the original statement would produce a strange object with surprising results.
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
<pie-in-the-sky>
An idea that has popped up from time to time is delaying evalution of a complicated expressions so that the result can be computed more efficiently. For instance, the matrix expression:
a = b*c + d*e
results in the creation of two, potentially large, temporary matrices and also does a couple of extra loops at the C level than the equivalent expression implemented in C would.
The general idea has been to construct some sort of psuedo-object, when the numerical operations are indicated, then do the actual <snip>
------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
On 2/28/06, Paul F. Dubois <paul@pfdubois.com> wrote:
You're reinventing C++ expression templates, although since Python is
Yes indeedy, and although they might work well enough they produce the most godawful looking assembly code I have ever looked at. The boost ublas template library takes this approach and I regard it more as a meta-compiler research project written in a template language than as an array library. I think that there are two main users of arrays: those who want quick and convenient (optimize programmer time) and those who want super-fast execution (optimize cpu time). Because a human can generally do a better job and knows more about the intent than the average compiler, I think that the best bet is to provide the tools needed to write efficient code if the programmer so desires, but otherwise concentrate on convenience. When absolute speed is essential it is worth budgeting programmer time to achieve it, but generally I don't think that is the case. Chuck
Charles R Harris wrote:
Yes indeedy, and although they might work well enough they produce the most godawful looking assembly code I have ever looked at. The boost ublas template library takes this approach and I regard it more as a meta-compiler research project written in a template language than as an array library. I think that there are two main users of arrays: those who want quick and convenient (optimize programmer time) and those who want super-fast execution (optimize cpu time). Because a human can generally do a better job and knows more about the intent than the average compiler, I think that the best bet is to provide the tools needed to write efficient code if the programmer so desires, but otherwise concentrate on convenience. When absolute speed is essential it is worth budgeting programmer time to achieve it, but generally I don't think that is the case.
I think this is ultimately why nothing has been done except to make it easier and easier to write compiled code that gets called from Python. I'm sure most have heard that ctypes will be added to Python 2.5. This will make it very easy to write a C function to do what you want and just call it from Python. Weave can still help with the "auto-compilation" of the specific library for your type. Ultimately such code will be faster than NumPy can every be. -Travis
Travis Oliphant wrote:
Weave can still help with the "auto-compilation" of the specific library for your type. Ultimately such code will be faster than NumPy can every be.
Yes. weave.blitz() can be used to do the equivalent of this lazy evaluation for you in many cases without much effort. For example: import weave from scipy import arange a = arange(1e7) b = arange(1e7) c=2.0*a+3.0*b # or with weave weave.blitz("c=2.0*a+3.0*b") As Paul D. mentioned.what Tim outlined is essentially template expressions in C++. blitz++ (http://www.oonumerics.org/blitz/) is a C++ template expressions library for array operations, and weave.blitz translates a Numeric expression into C++ blitz code. For the example above on large arrays, you get about a factor of 4 speed up on large arrays. (Notice, the first time you run the example it will be much slower because of compile time. Use timings from subsequent runs.) C:\temp>weave_time.py Expression: c=2.0*a+3.0*b Numeric: 0.678311899322 Weave: 0.162177084984 Speed-up: 4.18253848494 This isn't as good as you can do with hand coded C, but it isn't so bad for the effort involved... I have wished for time to write a weave.f90("c=2.0*a+3.0*b") function because it is very feasible. My guess from simple experimentation is that it would be about as fast as hand coded C for this sort of expression. This might give us another factor of two or three in execution speed and the compile times would come down from tens of seconds to tenths of seconds. Incidentally, the weave calling overhead is large enough to limit its benefit on small arrays. Pat Miller pointed out some ways to get rid of that overhead, and I even wrote some experimental fixes to weave that helped out a lot. Alas, they never were completed fully. Revisiting these would make weave.blitz useful for small arrays as well. Fixing these is probably more work than wrting blitz.f90() All this to say, I think weave basically accomplishes what Tim wants with a different mechanism (letting C++ compilers do the optimization instead of writing this optimization at the python level). It does require a compiler on client machines in its current form (even that can be fixed...), but I think it might prove faster than re-implementing a numeric expression compiler at the python level (though that sounds fun as well). see ya, eric ############################### #weave_time.py ############################### import timeit array_size = 1e7 iterations = 10 setup = """\ import weave from scipy import arange a=arange(%f) b=arange(%f) c=arange(%f) # needed by weave test """ % (array_size, array_size, array_size) expr = "c=2.0*a+3.0*b" print "Expression:", expr numeric_timer = timeit.Timer(expr, setup) numeric_time = numeric_timer.timeit(number=iterations) print "Numeric:", numeric_time/iterations weave_timer = timeit.Timer('weave.blitz("%s")' % expr, setup) weave_timer = timeit.Timer('weave.blitz("%s")' % expr, setup) weave_time = weave_timer.timeit(number=iterations) print "Weave:", weave_time/iterations print "Speed-up:", numeric_time/weave_time
-Travis
------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
eric jones <eric@enthought.com> writes:
Travis Oliphant wrote:
Weave can still help with the "auto-compilation" of the specific library for your type. Ultimately such code will be faster than NumPy can every be.
Yes. weave.blitz() can be used to do the equivalent of this lazy evaluation for you in many cases without much effort. For example:
import weave from scipy import arange
a = arange(1e7) b = arange(1e7) c=2.0*a+3.0*b
# or with weave weave.blitz("c=2.0*a+3.0*b")
As Paul D. mentioned.what Tim outlined is essentially template expressions in C++. blitz++ (http://www.oonumerics.org/blitz/) is a C++ template expressions library for array operations, and weave.blitz translates a Numeric expression into C++ blitz code. For the example above on large arrays, you get about a factor of 4 speed up on large arrays. (Notice, the first time you run the example it will be much slower because of compile time. Use timings from subsequent runs.)
C:\temp>weave_time.py Expression: c=2.0*a+3.0*b Numeric: 0.678311899322 Weave: 0.162177084984 Speed-up: 4.18253848494
All this to say, I think weave basically accomplishes what Tim wants with a different mechanism (letting C++ compilers do the optimization instead of writing this optimization at the python level). It does require a compiler on client machines in its current form (even that can be fixed...), but I think it might prove faster than re-implementing a numeric expression compiler at the python level (though that sounds fun as well).
I couldn't leave it at that :-), so I wrote my bytecode idea up. You can grab it at http://arbutus.mcmaster.ca/dmc/software/numexpr-0.1.tar.gz Here are the performance numbers (note I use 1e6 elements instead of 1e7) cookedm@arbutus$ py -c 'import numexpr.timing; numexpr.timing.compare()' Expression: b*c+d*e numpy: 0.0934900999069 Weave: 0.0459051132202 Speed-up of weave over numpy: 2.03659447388 numexpr: 0.0467489004135 Speed-up of numexpr over numpy: 1.99983527056 Expression: 2*a+3*b numpy: 0.0784019947052 Weave: 0.0292909860611 Speed-up of weave over numpy: 2.67665945222 numexpr: 0.0323888063431 Speed-up of numexpr over numpy: 2.42065094572 Wow. On par with weave.blitz, and no C++ compiler!!! :-) You use it like this: from numexpr import numexpr func = numexpr("2*a+3*b") a = arange(1e6) b = arange(1e6) c = func(a, b) Alternatively, you can use it like weave.blitz, like this: from numexpr import evaluate a = arange(1e6) b = arange(1e6) c = evaluate("2*a+3*b") How it works ============ Well, first of all, it only works with the basic operations (+, -, *, and /), and only on real constants and arrays of doubles. (These restrictions are mainly for lack of time, of course.) Given an expression, it first compiles it to a program written in a small bytecode. Here's what it looks like: In [1]: from numexpr import numexpr In [2]: numexpr("2*a+3*b", precompiled=True) # precompiled=True just returns the program before it's made into a # bytecode object, so you can what it looks like Out[2]: [('mul_c', Register(0), Register(1, a), Constant(0, 2)), ('mul_c', Temporary(3), Register(2, b), Constant(1, 3)), ('add', Register(0), Register(0), Temporary(3))] In [3]: c = numexpr("2*a+3*b") In [4]: c.program Out[4]: '\x08\x00\x01\x00\x08\x03\x02\x01\x02\x00\x00\x03' In [5]: c.constants Out[5]: array([ 2., 3.]) In [6]: c.n_temps Out[6]: 1 In [7]: type(c) Out[7]: <type 'numexpr.NumExpr'> The bytecode program is a string of 4-character groups, where the first character of each group is the opcode, the second is the "register" to store in, and the third and fourth are the register numbers to use as arguments. Inputs and outputs are assigned to registers (so in the example above, the output is Register(0), and the two inputs are Register(1) and Register(2)), and temporaries are in remaining registers. The group of registers is actually an array of double*. The secret behind the speed is that the bytecode program is run blocked: it's run repeatedly, operating on 128 elements of each input array at at time (then 8 elements then 1 element for the leftover). This does a tradeoff between cache misses (the arguments for each time through the program will most likely be able to fit in the cache each time), and the overhead of the branching from evaluating the program. It's like doing this in numpy: c[0:128] = 2*a[0:128] + 3*b[0:128] c[128:256] = 2*a[128:256] + 3*b[128:256] etc. I could see this getting a nice speedup from using a CPU's vector instructions. For instance, it should be pretty easy to use liboil for the intermediate vector evaluations. If people think it's useful enough, I can check it into scipy's sandbox. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
I couldn't leave it at that :-), so I wrote my bytecode idea up.
You can grab it at http://arbutus.mcmaster.ca/dmc/software/numexpr-0.1.tar.gz
Awesome. Please do check it in to the sandbox. -Travis
On 02/03/2006, at 7:12 PM, David M. Cooke wrote:
I couldn't leave it at that :-), so I wrote my bytecode idea up.
You can grab it at http://arbutus.mcmaster.ca/dmc/software/ numexpr-0.1.tar.gz
I'd had a brief look at the code, and I think it's great! I particularly like the power available with the "numexpr.evaluate (string_expression)" syntax. I'd fully support your adding it to the scipy sandbox. I'm sure we could fill it out with more operators and data types quite fast. Great work! -- Ed
David M. Cooke wrote:
eric jones <eric@enthought.com> writes:
Travis Oliphant wrote:
Weave can still help with the "auto-compilation" of the specific library for your type. Ultimately such code will be faster than NumPy can every be.
Yes. weave.blitz() can be used to do the equivalent of this lazy evaluation for you in many cases without much effort. For example:
import weave from scipy import arange
a = arange(1e7) b = arange(1e7) c=2.0*a+3.0*b
# or with weave weave.blitz("c=2.0*a+3.0*b")
As Paul D. mentioned.what Tim outlined is essentially template expressions in C++. blitz++ (http://www.oonumerics.org/blitz/) is a C++ template expressions library for array operations, and weave.blitz translates a Numeric expression into C++ blitz code. For the example above on large arrays, you get about a factor of 4 speed up on large arrays. (Notice, the first time you run the example it will be much slower because of compile time. Use timings from subsequent runs.)
C:\temp>weave_time.py Expression: c=2.0*a+3.0*b Numeric: 0.678311899322 Weave: 0.162177084984 Speed-up: 4.18253848494
All this to say, I think weave basically accomplishes what Tim wants with a different mechanism (letting C++ compilers do the optimization instead of writing this optimization at the python level). It does require a compiler on client machines in its current form (even that can be fixed...), but I think it might prove faster than re-implementing a numeric expression compiler at the python level (though that sounds fun as well).
I couldn't leave it at that :-), so I wrote my bytecode idea up.
You can grab it at http://arbutus.mcmaster.ca/dmc/software/numexpr-0.1.tar.gz
Here are the performance numbers (note I use 1e6 elements instead of 1e7)
cookedm@arbutus$ py -c 'import numexpr.timing; numexpr.timing.compare()' Expression: b*c+d*e numpy: 0.0934900999069 Weave: 0.0459051132202 Speed-up of weave over numpy: 2.03659447388 numexpr: 0.0467489004135 Speed-up of numexpr over numpy: 1.99983527056
Expression: 2*a+3*b numpy: 0.0784019947052 Weave: 0.0292909860611 Speed-up of weave over numpy: 2.67665945222 numexpr: 0.0323888063431 Speed-up of numexpr over numpy: 2.42065094572
Wow. On par with weave.blitz, and no C++ compiler!!! :-)
That's awesome! I was also tempted by this, but I never got beyond prototyping some stuff in Python.
You use it like this:
from numexpr import numexpr
func = numexpr("2*a+3*b")
a = arange(1e6) b = arange(1e6) c = func(a, b)
Does this just uses the order that variable are initially used in the expression to determine the input order? I'm not sure I like that. It is very convenient for simple expressions though.
Alternatively, you can use it like weave.blitz, like this:
from numexpr import evaluate
a = arange(1e6) b = arange(1e6) c = evaluate("2*a+3*b")
That's pretty sweet. And I see that you cache the expressions, so it should be pretty fast if you need to loop. [snip details]
If people think it's useful enough, I can check it into scipy's sandbox.
Definitely check it in. It won't compile here with VC7, but I'll see if I can figure out why. This is probably thinking two far ahead, but an interesting thing to try would be adding conditional expressions: c = evaluate("(2*a + b) if (a > b) else (2*b + a)") If that could be made to work, and work fast, it would save both memory and time in those cases where you have to vary the computation based on the value. At present I end up computing the full arrays for each case and then choosing which result to use based on a mask, so it takes three times as much space as doing it element by element. -tim
Tim Hochberg <tim.hochberg@cox.net> writes:
David M. Cooke wrote:
Wow. On par with weave.blitz, and no C++ compiler!!! :-)
That's awesome! I was also tempted by this, but I never got beyond prototyping some stuff in Python.
You use it like this:
from numexpr import numexpr
func = numexpr("2*a+3*b")
a = arange(1e6) b = arange(1e6) c = func(a, b)
Does this just uses the order that variable are initially used in the expression to determine the input order? I'm not sure I like that. It is very convenient for simple expressions though.
By default, it does lexical order (whatever .sort() on the list of variable names gives). You can specify the order with func = numexpr("2*a+3*b", input_order=('a', 'b'))
Alternatively, you can use it like weave.blitz, like this:
from numexpr import evaluate
a = arange(1e6) b = arange(1e6) c = evaluate("2*a+3*b")
That's pretty sweet. And I see that you cache the expressions, so it should be pretty fast if you need to loop.
[snip details]
If people think it's useful enough, I can check it into scipy's sandbox.
Definitely check it in. It won't compile here with VC7, but I'll see if I can figure out why.
It's in. Check the setup.py -- I left some gcc'isms in there (-funroll-loops). That should be conditional on the compiler.
This is probably thinking two far ahead, but an interesting thing to try would be adding conditional expressions:
c = evaluate("(2*a + b) if (a > b) else (2*b + a)")
If that could be made to work, and work fast, it would save both memory and time in those cases where you have to vary the computation based on the value. At present I end up computing the full arrays for each case and then choosing which result to use based on a mask, so it takes three times as much space as doing it element by element.
It's on my list of things to think about, along with implementing reductions (sum, in particular). It'd probably look more like c = evaluate("where(a > b, 2*a+b, 2*b+a)") because of the vector nature of the code. Doing the "if" elementwise would mean the bytecode program would have to be run for each element, instead of on a vector of some fixed length. I wrote up a bit on my blog at http://isobaric.blogspot.com/ (finally, something to blog about :-) -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
David M. Cooke wrote:
It's in. Check the setup.py -- I left some gcc'isms in there (-funroll-loops). That should be conditional on the compiler.
I does whine about that, but it ignores it, so I did to ;). The misplaced declaration of dims I mentioned earlier was the real culprit.
This is probably thinking two far ahead, but an interesting thing to try would be adding conditional expressions:
c = evaluate("(2*a + b) if (a > b) else (2*b + a)")
If that could be made to work, and work fast, it would save both memory and time in those cases where you have to vary the computation based on the value. At present I end up computing the full arrays for each case and then choosing which result to use based on a mask, so it takes three times as much space as doing it element by element.
It's on my list of things to think about, along with implementing reductions (sum, in particular). It'd probably look more like
c = evaluate("where(a > b, 2*a+b, 2*b+a)")
because of the vector nature of the code. Doing the "if" elementwise would mean the bytecode program would have to be run for each element, instead of on a vector of some fixed length.
That makes sense. One thought I had with respect to the various numpy functions (sin, cos, pow, etc) was to just have the bytecodes: call_unary_function, function_id, store_in, source call_binary_function, function_id, store_in, source1, source2 call_trinary_function, function_id, store_in, source1, source2, source3 Then just store pointers to the functions in relevant tables. In it's most straightforward form, you'd need 6 character chunks of bytecode instead of four. However, if that turns out to slow everything else down I think it could be packed down to 4 again. The function_ids could probably be packed into the opcode (as long as we stay below 200 or so functions, which is probably safe), the other way to pack things down is to require that one of the sources for trinary functions is always a certain register (say register-0). That requires a bit more cleverness at the compiler level, but is probably feasible. OT, stupid subversion question: where do I find the sandbox or whereever this got checked in? -tim
Tim Hochberg wrote:
OT, stupid subversion question: where do I find the sandbox or whereever this got checked in?
http://svn.scipy.org/svn/scipy/trunk/Lib/sandbox/numexpr/ Our sandbox is actually scipy.sandbox. Uncomment out the appropriate line in Lib/sandbox/setup.py to build. -- Robert Kern robert.kern@gmail.com "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Travis et al., Jeff Whitaker found that the imaged_masked.py example (with the colorbar line uncommented) in matplotlib 0.87 triggered a numpy bug--the script works normally with Numeric and numarray. He committed a workaround to mpl svn, but I suspect you may want to track it down and squash it in numpy before the next release. It is not clear to me whether it is really in ma, or merely revealed by ma. The attached script triggers the bug. Thanks. Eric
I would say it is an ma bug, but don't know how to fix it properly without changing numpy scalar arithmetic to call an enhanced version of __array__ and pass context in. The core problem can be demonstrated by the following session:
from numpy import * x = ma.array([1],mask=[1])
int_(1)*x Traceback (most recent call last): File "<stdin>", line 1, in ? File ".../numpy/core/ma.py", line 622, in __array__ raise MAError, \ numpy.core.ma.MAError: Cannot automatically convert masked array to numeric because data is masked in one or more locations.
Note that x*int_(1) works as expected. This is so because python dispatches multiplication to int.__mul__ rather than ma.array.__mul__ if int_(1) is the first multiplicand. I've fixed a similar problem for array*ma.array case and array(1)*x works in the current version of numpy. I will not have time to work on this before the weekend, so if someone is motivated enought to fix this bug before the upcoming release, please take a look at http://projects.scipy.org/scipy/numpy/wiki/MaskedArray ("Ufuncs and Masked Arrays"). It should be straightforward to generalize that approach for array scalars. On 3/9/06, Eric Firing <efiring@hawaii.edu> wrote:
Travis et al.,
Jeff Whitaker found that the imaged_masked.py example (with the colorbar line uncommented) in matplotlib 0.87 triggered a numpy bug--the script works normally with Numeric and numarray. He committed a workaround to mpl svn, but I suspect you may want to track it down and squash it in numpy before the next release. It is not clear to me whether it is really in ma, or merely revealed by ma. The attached script triggers the bug.
Thanks.
Eric
Sasha wrote:
I would say it is an ma bug, but don't know how to fix it properly without changing numpy scalar arithmetic to call an enhanced version of __array__ and pass context in.
I think it was an array scalar bug. I've fixed it by always calling the ufunc (which handles conversions to and from other objects better than the array scalars were doing in the generic arithmetic code). The result is shorter code and symmetric handling of the <array_scalar>*<masked array> vs. <masked array> * <array_scalar> case. I also changed the error to a warning for backward compatibility. Calling __array__() on a masked array should succeed and we will assume the user knows what they are doing. -Travis
It was a design decision: if an automatic conversion of a masked array to a numeric one is attempted, and there actually are masked values, it is logical to either: a. raise an exception, because there is no equivalent numeric array, or; b. fill the masked value with the fill value (b) suffers from a problem: the default fill value might not be appropriate for the use that is about to occur. My original conclusion was that it was better to choose (a) to force the user to explicitly do the conversion with x.fill(value) and pass THAT to whatever was going to consume it. I don't think I've changed my mind. On 10 Mar 2006 02:32:19 -0800, Travis Oliphant <oliphant.travis@ieee.org> wrote:
Sasha wrote:
I would say it is an ma bug, but don't know how to fix it properly without changing numpy scalar arithmetic to call an enhanced version of __array__ and pass context in.
I think it was an array scalar bug. I've fixed it by always calling the ufunc (which handles conversions to and from other objects better than the array scalars were doing in the generic arithmetic code).
The result is shorter code and symmetric handling of the <array_scalar>*<masked array> vs. <masked array> * <array_scalar> case.
I also changed the error to a warning for backward compatibility. Calling __array__() on a masked array should succeed and we will assume the user knows what they are doing.
-Travis
------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Tim Hochberg <tim.hochberg@cox.net> writes:
That makes sense. One thought I had with respect to the various numpy functions (sin, cos, pow, etc) was to just have the bytecodes:
call_unary_function, function_id, store_in, source call_binary_function, function_id, store_in, source1, source2 call_trinary_function, function_id, store_in, source1, source2, source3
Then just store pointers to the functions in relevant tables. In it's most straightforward form, you'd need 6 character chunks of bytecode instead of four. However, if that turns out to slow everything else down I think it could be packed down to 4 again. The function_ids could probably be packed into the opcode (as long as we stay below 200 or so functions, which is probably safe), the other way to pack things down is to require that one of the sources for trinary functions is always a certain register (say register-0). That requires a bit more cleverness at the compiler level, but is probably feasible.
That's along the lines I'm thinking of. It seems to me that if evaluating the function requires a function call (and not an inlined machine instruction like the basic ops), then we may as well dispatch like this (plus, it's easier :). This could also allow for user extensions. Binary and trinary (how many of those do we have??) could maybe be handled by storing the extra arguments in a separate array. I'm going to look at adding more smarts to the compiler, too. Got a couple books on them :-) Different data types could be handled by separate input arrays, and a conversion opcode ('int2float', say). -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
That makes sense. One thought I had with respect to the various numpy functions (sin, cos, pow, etc) was to just have the bytecodes:
call_unary_function, function_id, store_in, source call_binary_function, function_id, store_in, source1, source2 call_trinary_function, function_id, store_in, source1, source2, source3
Then just store pointers to the functions in relevant tables. In it's most straightforward form, you'd need 6 character chunks of bytecode instead of four. However, if that turns out to slow everything else down I think it could be packed down to 4 again. The function_ids could probably be packed into the opcode (as long as we stay below 200 or so functions, which is probably safe), the other way to pack things down is to require that one of the sources for trinary functions is always a certain register (say register-0). That requires a bit more cleverness at the compiler level, but is probably feasible.
That's along the lines I'm thinking of. It seems to me that if evaluating the function requires a function call (and not an inlined machine instruction like the basic ops), then we may as well dispatch like this (plus, it's easier :). This could also allow for user extensions.
Hmm. If we are going to allow user extensions, we should think about this now. I was thinking of just putting all of the functions in a big table and doing a lookup based on bytecode. I'm not sure how this should work if we are allowing user extensions though.
Binary and trinary (how many of those do we have??)
I'm sure there are more that I'm not thinking of, but the main function that I am concerned with is arctan2. For trinary, the main one is which.
could maybe be handled by storing the extra arguments in a separate array.
If there really are only a few functions, we could give them their own opcodes. That means binary function still fit in 4 characters without a problem. That is: OP_ATAN2 R0, R1, R2 would compute: R0[:] = arctan2(R1, R2) Trinary could be defined as using their input as one of the arguments. This would sometimes require an extra copy and some cleverness in the compiler, but I think it would be reasonable. Thus OP_WHICH R0, R1, R2 would compute: R0[:] = which(R0, R1, R2) since the first argument of 'which' is often thrown away anyway, for instance 'which (a < 1, b, c)', the extra copy could often be avoided if we're smart. This might be a bad idea if there turns out to be several more trinary functions that I'm not thinking of or if there are quaternary functions we want to support. (I can't think of any though).
I'm going to look at adding more smarts to the compiler, too. Got a couple books on them :-)
Do you have any recomendations?
Different data types could be handled by separate input arrays, and a conversion opcode ('int2float', say).
That'd be cool. At some point I want to handle complex, but I'm going to wait till the main issues with floats shake themselves out. -tim
Tim Hochberg wrote:
If there really are only a few functions, we could give them their own opcodes. That means binary function still fit in 4 characters without a problem. That is:
OP_ATAN2 R0, R1, R2
would compute: R0[:] = arctan2(R1, R2)
Trinary could be defined as using their input as one of the arguments. This would sometimes require an extra copy and some cleverness in the compiler, but I think it would be reasonable. Thus
OP_WHICH R0, R1, R2
would compute: R0[:] = which(R0, R1, R2)
since the first argument of 'which' is often thrown away anyway, for instance 'which (a < 1, b, c)', the extra copy could often be avoided if we're smart. This might be a bad idea if there turns out to be several more trinary functions that I'm not thinking of or if there are quaternary functions we want to support. (I can't think of any though).
Then again, it might be better to just do a simple, general solution to this case and pass in the arguments in an array as you say. The above might be pointless overengineering. -tim
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
That makes sense. One thought I had with respect to the various numpy functions (sin, cos, pow, etc) was to just have the bytecodes:
call_unary_function, function_id, store_in, source call_binary_function, function_id, store_in, source1, source2 call_trinary_function, function_id, store_in, source1, source2, source3
Then just store pointers to the functions in relevant tables. In it's most straightforward form, you'd need 6 character chunks of bytecode instead of four. However, if that turns out to slow everything else down I think it could be packed down to 4 again. The function_ids could probably be packed into the opcode (as long as we stay below 200 or so functions, which is probably safe), the other way to pack things down is to require that one of the sources for trinary functions is always a certain register (say register-0). That requires a bit more cleverness at the compiler level, but is probably feasible.
That's along the lines I'm thinking of. It seems to me that if evaluating the function requires a function call (and not an inlined machine instruction like the basic ops), then we may as well dispatch like this (plus, it's easier :). This could also allow for user extensions. Binary and trinary (how many of those do we have??) could maybe be handled by storing the extra arguments in a separate array.
We may want to reconsider this at least partially. I tried implementing a few 1-argument functions two way. First as a table lookup and second using a dedicated opcode. The first gave me a speedup of almost 60%, but the latter gave me a speedup of 100%. The difference suprised me, but I suspect it's do to the fact that the x86 supports some functions directly, so the function call gets optimized away for sin and cos just as it does for +-*/. That implies that some functions should get there own opcodes, while others are not worthy. This little quote from wikipedia lists the functions we would want to give there own opcodes to: x86 (since the 80486DX processor) assembly language includes a stack based floating point unit which can perform addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions which can load or store a value from memory in any of the following formats: Binary coded decimal, 32-bit integer, 64-bit integer, 32-bit floating point, 64-bit floating point or 80-bit floating point (upon loading, the value is converted to the currently floating point mode). The x86 also includes a number of transcendental functions including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or e <http://www.answers.com/main/ntquery;jsessionid=aal3fk2ck8cuc?method=4&dsid=2222&dekey=E+%28mathematical+constant%29&gwp=8&curtab=2222_1&sbid=lc04b>. So, that's my new proposal: some functions get there own opcodes (sin, cos, ln, log10, etc), while others get shunted to table lookup (not sure what's in that list yet, but I'm sure there's lots). -tim
I'm going to look at adding more smarts to the compiler, too. Got a couple books on them :-)
Different data types could be handled by separate input arrays, and a conversion opcode ('int2float', say).
Tim Hochberg wrote:
We may want to reconsider this at least partially. I tried implementing a few 1-argument functions two way. First as a table lookup and second using a dedicated opcode. The first gave me a speedup of almost 60%, but the latter gave me a speedup of 100%. The difference suprised me, but I suspect it's do to the fact that the x86 supports some functions directly, so the function call gets optimized away for sin and cos just as it does for +-*/. That implies that some functions should get there own opcodes, while others are not worthy. This little quote from wikipedia lists the functions we would want to give there own opcodes to:
x86 (since the 80486DX processor) assembly language includes a stack based floating point unit which can perform addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions which can load or store a value from memory in any of the following formats: Binary coded decimal, 32-bit integer, 64-bit integer, 32-bit floating point, 64-bit floating point or 80-bit floating point (upon loading, the value is converted to the currently floating point mode). The x86 also includes a number of transcendental functions including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or e
So, that's my new proposal: some functions get there own opcodes (sin, cos, ln, log10, etc), while others get shunted to table lookup (not sure what's in that list yet, but I'm sure there's lots).
For the same reason, I think these same functions should get their own ufunc loops instead of using the default loop with function pointers. Thanks for providing this link. It's a useful list. -Travis
Tim Hochberg <tim.hochberg@cox.net> writes:
We may want to reconsider this at least partially. I tried implementing a few 1-argument functions two way. First as a table lookup and second using a dedicated opcode. The first gave me a speedup of almost 60%, but the latter gave me a speedup of 100%. The difference suprised me, but I suspect it's do to the fact that the x86 supports some functions directly, so the function call gets optimized away for sin and cos just as it does for +-*/. That implies that some functions should get there own opcodes, while others are not worthy.
Yeah, I was going to stare at the assembly output from gcc before deciding. Note that some functions that are supported directly by the CPU may still be a function call because they aren't good enough. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
cookedm@physics.mcmaster.ca (David M. Cooke) writes:
Tim Hochberg <tim.hochberg@cox.net> writes:
We may want to reconsider this at least partially. I tried implementing a few 1-argument functions two way. First as a table lookup and second using a dedicated opcode. The first gave me a speedup of almost 60%, but the latter gave me a speedup of 100%. The difference suprised me, but I suspect it's do to the fact that the x86 supports some functions directly, so the function call gets optimized away for sin and cos just as it does for +-*/. That implies that some functions should get there own opcodes, while others are not worthy.
Yeah, I was going to stare at the assembly output from gcc before deciding.
Note that some functions that are supported directly by the CPU may still be a function call because they aren't good enough.
... and a quick test shows me that with gcc 4 under linux, only sqrt is inlined (and it does a test for NaN (I think) and calls the C library sqrt on that). On PowerPC under OS X, even sqrt is a function call. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
cookedm@physics.mcmaster.ca (David M. Cooke) writes:
Yeah, I was going to stare at the assembly output from gcc before deciding.
Note that some functions that are supported directly by the CPU may still be a function call because they aren't good enough.
... and a quick test shows me that with gcc 4 under linux, only sqrt is inlined (and it does a test for NaN (I think) and calls the C library sqrt on that). On PowerPC under OS X, even sqrt is a function call.
There were some posts earlier showing that numarray had faster sin(x) functions on an INTEL platform (with a particular compiler that I'm not sure was specified). I'm pretty sure the difference was the function call versus the inline sin computation that the compiler used. I bet it depends on the compiler. But, you bring up a good point that the on-chip provided functions might not be "math-lib" worthy according to different compilers. It would be nice if there was still some way to use them though letting the user decide if they were "good enough" -Travis
FYI, In my local copy of numexpr, I added a few unary functions just to see how it would be. Pretty easy reall. I think binary functions will also be easy. Trinary functions (e.g., where) will be a bit harder unless we move to 5 byte opcodes. At present I'm not doing anything fancy like looking up functions in a table, I'm just giving each it's own opcodes. If the code gets big enough that it starts falling out of the cache, we might consider that, but adding opcodes is simpler. I also fixed a couple places where constant folding seemed to be not completely working. Once I figure out where the sandbox is I'll upload the changes.... Expression: 2*a+(cos(3)+5)*sin(b) numpy: 0.224821311088 numexpr: 0.112924547673 Speed-up of numexpr over numpy: 1.99089848683 -tim
David M. Cooke wrote:
I couldn't leave it at that :-), so I wrote my bytecode idea up.
I like what you've done. It makes me wonder what could be done with just a Python-level solution (that might be easier to generalize to more data-types and operators). You've definitely given us all some great stuff to play with. -Travis
Travis Oliphant wrote:
David M. Cooke wrote:
I couldn't leave it at that :-), so I wrote my bytecode idea up.
I like what you've done. It makes me wonder what could be done with just a Python-level solution (that might be easier to generalize to more data-types and operators).
I tried hand coding stuff at the pure Python level to see how close I could come to David's results. I performed the operations in chunks (larger chunks than he did so that Python overhead didn't kill me) For the second case he presents (2*a+3*b), I managed to get a 70% speedup. While not negligible, it's a far cry from the 180% speedup I saw with David's code for the same result. So while doing things at the Python level would certainly be easier, the payoff isn't likely to be nearly as big.
You've definitely given us all some great stuff to play with.
Indeed.
Tim Hochberg wrote:
Travis Oliphant wrote:
David M. Cooke wrote:
I couldn't leave it at that :-), so I wrote my bytecode idea up.
I like what you've done. It makes me wonder what could be done with just a Python-level solution (that might be easier to generalize to more data-types and operators).
I tried hand coding stuff at the pure Python level to see how close I could come to David's results. I performed the operations in chunks (larger chunks than he did so that Python overhead didn't kill me) For the second case he presents (2*a+3*b), I managed to get a 70% speedup. While not negligible, it's a far cry from the 180% speedup I saw with David's code for the same result. So while doing things at the Python level would certainly be easier, the payoff isn't likely to be nearly as big.
That's good information. Still, it might be useful for more general-operators and data-types to have something that does essentially this in Python only. Then, if you have a simpler expression with a few select data-types you could have the big-speed up with something more specific. -Travis
eric jones wrote:
All this to say, I think weave basically accomplishes what Tim wants with a different mechanism (letting C++ compilers do the optimization instead of writing this optimization at the python level). It does require a compiler on client machines in its current form (even that can be fixed...)
Perhaps by teaching Psyco about nd-arrays? -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Christopher Barker wrote:
eric jones wrote:
All this to say, I think weave basically accomplishes what Tim wants with a different mechanism (letting C++ compilers do the optimization instead of writing this optimization at the python level). It does require a compiler on client machines in its current form (even that can be fixed...)
Perhaps by teaching Psyco about nd-arrays?
Psyco is unlikely to ever be much good for floating point stuff. This is because it only knows about longs/pointers and arrays of longs/pointers. The floating point support it has at present involves breaking a double into two long sized chunks. Then whenever you need to manipulate the float you have to find the pieces and put them back together[1]. All of this means that float support is much poorer then integer support in Psyco. In theory this could be fixed, but since Armin's no longer actively working on Psyco, just maintaining it, I don't see this changing. Perhaps when Psyco like technology gets incorporated in PyPy it will include better support for floating point. Regards, -tim [1] This has all gotten rather fuzzy too me now, it's been a while since I looked at it and my understanding was at best fragile anyway.
participants (13)
-
Charles R Harris
-
Christopher Barker
-
cookedm@physics.mcmaster.ca
-
Ed Schofield
-
Eric Firing
-
eric jones
-
Paul Dubois
-
Paul F. Dubois
-
Robert Kern
-
Sasha
-
Tim Hochberg
-
Travis Oliphant
-
Travis Oliphant