1. David already mentioned opcodes to copy from an integer array into a float register. Another idea would be to have an opcode to copy from a strided array into a register. This would save a copy for noncontiguous matrices and opens up the door to doing broadcasting. I think we'd have to tweak the main interpreter loop a bit, but I think it's doable. Depending how crunched we feel for opcode space, the casting array and this array could overlap (OP_CAST_AND_COPY for instance). 2. Something I noticed while writing tests is that the rearangement of operations for 'a/const' to '(1/const)*a' changes the results slightly (try const=3). I don't think I care -- I just thought I'd point it out. 3. It may simplify things to use copy_c to eliminate a bunch of the extra bytecodes for operating on functions of more than one argument. I need to check the timings on this -- I don't know how much of a speed hit using copy_c will cause. 4. At some point I plan to add complex operations. My original thought was to just duplicate the current, floating-point interpreter for complex numbers. However, if we have opcodes for casting, it might instead make sense to have one interpreter that works with both. This would be more efficient for mixed expressions since we wouldnt' have to case everything complex right away. It looks like there's enough opcode space, although I worry a bit whether making that main switch loop huge is going to slow things down. Probably not, but it's something to keep an eye on. 5. Same thoughts as 4., but for integer operations. Potentially the interpreter could work with longs, doubles and cdoubles, downcasting as appropriate on the way out. It's looking like numexpr could come pretty close to evaluating most numpy expressions if we put enough work into it. None of the required changes look like they should slow down the original, simple, fast case. However we should keep an eye on that as we go along. It's all very cool; I'd like to thank David for providing such a fun toy. -tim
Tim Hochberg <tim.hochberg@cox.net> writes:
1. David already mentioned opcodes to copy from an integer array into a float register. Another idea would be to have an opcode to copy from a strided array into a register. This would save a copy for noncontiguous matrices and opens up the door to doing broadcasting. I think we'd have to tweak the main interpreter loop a bit, but I think it's doable. Depending how crunched we feel for opcode space, the casting array and this array could overlap (OP_CAST_AND_COPY for instance).
Hadn't thought about doing strided arrays like that; it should work.
2. Something I noticed while writing tests is that the rearangement of operations for 'a/const' to '(1/const)*a' changes the results slightly (try const=3). I don't think I care -- I just thought I'd point it out.
I don't care either :-) Although it may be worthwile throwing in a compiler option to keep a/const.
3. It may simplify things to use copy_c to eliminate a bunch of the extra bytecodes for operating on functions of more than one argument. I need to check the timings on this -- I don't know how much of a speed hit using copy_c will cause.
I'm implementing another solution: I'm getting rid of individual constant operators, like 'add_c'. Constants will be stored in the temps array, as 128 repetitions of the constant. That way 'add' (for instance) can operate on both vector + vector and vector + scalar: it would take the register numbers for the two vectors for the first one, and for the second case, the scalar argument would be the register number for the repeated constant. Timings on my iBook show this is actually slightly faster. I didn't expect that :-)
4. At some point I plan to add complex operations. My original thought was to just duplicate the current, floating-point interpreter for complex numbers. However, if we have opcodes for casting, it might instead make sense to have one interpreter that works with both. This would be more efficient for mixed expressions since we wouldnt' have to case everything complex right away. It looks like there's enough opcode space, although I worry a bit whether making that main switch loop huge is going to slow things down. Probably not, but it's something to keep an eye on.
The way I see it, there should be two machines: one for single precision, and one for double precision (maybe one for longdoubles, but I'm not worrying about that now). That's simple enough. Each machine should work with floats, and the complex form of that float. Potentially, cast opcodes could be used to convert single to double, for instance. I'm still thinking about how to fit that.
5. Same thoughts as 4., but for integer operations. Potentially the interpreter could work with longs, doubles and cdoubles, downcasting as appropriate on the way out.
I suppose this could be considered a "reduction" operation: instead of collapsing the array, we're collapsing the precision.
It's looking like numexpr could come pretty close to evaluating most numpy expressions if we put enough work into it. None of the required changes look like they should slow down the original, simple, fast case. However we should keep an eye on that as we go along. It's all very cool; I'd like to thank David for providing such a fun toy.
thanks Right now, here's my thoughts on where to go: 1. I'm rewriting the compiler to be easier to play with. Top things on the list are better register allocation, and common subexpression elimination. 2. Replace individual constants with "constant arrays": repeations of the constant in the vector registers. 3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too. 4. Add complex numbers. If it doesn't look really messy, we could add them to the current machine. Otherwise, we could make a machine that works on complex numbers only. 5. Currently, we use a big switch statement. There are ways (taken from Forth) that are better: indirect and direct threading. Unfortunately, it looks the easy way to do these uses GCC's capability to take the address of local labels. I'll add that if I can refactor the machine enough so that both variants can be produced. Have a look at http://www.complang.tuwien.ac.at/anton/vmgen/ which is the virtual machine generator used for gforth (but applicable to other things). I may use this. 6. Replace register numbers in the program with actual pointers to the correct register. That should remove a layer of indirection. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
5. Currently, we use a big switch statement. There are ways (taken from Forth) that are better: indirect and direct threading. Unfortunately, it looks the easy way to do these uses GCC's capability to take the address of local labels. I'll add that if I can refactor the machine enough so that both variants can be produced. Have a look at http://www.complang.tuwien.ac.at/anton/vmgen/ which is the virtual machine generator used for gforth (but applicable to other things). I may use this.
Hmmm. If LLVM weren't so huge and such a pain to install, I might recommend looking at using it. It could make a fun experiment, though. -- Robert Kern robert.kern@gmail.com "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Robert Kern <robert.kern@gmail.com> writes:
David M. Cooke wrote:
5. Currently, we use a big switch statement. There are ways (taken from Forth) that are better: indirect and direct threading. Unfortunately, it looks the easy way to do these uses GCC's capability to take the address of local labels. I'll add that if I can refactor the machine enough so that both variants can be produced. Have a look at http://www.complang.tuwien.ac.at/anton/vmgen/ which is the virtual machine generator used for gforth (but applicable to other things). I may use this.
Hmmm. If LLVM weren't so huge and such a pain to install, I might recommend looking at using it. It could make a fun experiment, though.
Yeah, I had a look at that. PyPy is using it, so things could be stolen from that. Fortunately, our virtual machine can be simpler than most, because we don't have conditionals or jumps :-) -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
1. David already mentioned opcodes to copy from an integer array into
a float register. Another idea would be to have an opcode to copy from a strided array into a register. This would save a copy for noncontiguous matrices and opens up the door to doing broadcasting. I think we'd have to tweak the main interpreter loop a bit, but I think it's doable. Depending how crunched we feel for opcode space, the casting array and this array could overlap (OP_CAST_AND_COPY for instance).
Hadn't thought about doing strided arrays like that; it should work.
This opens another interesting can of worms: two stage compilation. If we aren't copying the arrays on the way in, it might make sense to have an abstract opcode OP_COPY_ANY, that copied data from any allowable source. This would be produced during the compilation stage. When numexpr was called, before the loop was executed, OP_COPY_ANY would be replaced depending on the input value. For example: OP_COPY_ANY destreg input would translate to: OP_COPY_INT16ARR destreg input # input is an int array OP_COPY_FLT32SCL destreg input # input is an FP scalar etc. The data would then be copied from the array into a register with appropriate striding and casting. If 'input' was already a contiguous double array, then it could translate into simply setting a pointer as is done now (OP_COPY_SIMPLE perhaps). [SNIP]
Right now, here's my thoughts on where to go:
1. I'm rewriting the compiler to be easier to play with. Top things on the list are better register allocation, and common subexpression elimination.
Cool.
2. Replace individual constants with "constant arrays": repeations of the constant in the vector registers.
I'm already using this trick to a certain extent to reduce the number op func_xy opcodes. I copy the constants into arrays using OP_COPY_C. This sounds like essentially the same thing as what you are doing.
3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too.
Do you mean that sum() and friends could only occur as the outermost function. That is: "sum(a+3*b)" would work, but: "where(a, sum(a+3*b), c)" would not? Or am I misunderstanding you here? I don't think I like that limitation if that's the case. I don' think it should be necessary either.
4. Add complex numbers. If it doesn't look really messy, we could add them to the current machine. Otherwise, we could make a machine that works on complex numbers only.
It doesn't seem like it should be bad at all. The slickest thing would be to use to adjacent float buffers for a complex buffer. That would make the compiler a bit more complex, but it would keep the intrepeter simpler as there would only be a single buffer type. All that needs to be supported are the basic operations (+,-,*,/,// and **); comparisons don't work for complex numbers anyway and all of the functions can go through the function pointers since they're slow anyway. The one exception is where, which would be mixed complex and float and should be fast. The other question is integers. There would be some advantages to supporting mixed integer and floating point operations. This adds a larger crop of operators though; The basic ones as above, plus comparisons, plus bitwise operators.
5. Currently, we use a big switch statement. There are ways (taken from Forth) that are better: indirect and direct threading. Unfortunately, it looks the easy way to do these uses GCC's capability to take the address of local labels. I'll add that if I can refactor the machine enough so that both variants can be produced. Have a look at http://www.complang.tuwien.ac.at/anton/vmgen/ which is the virtual machine generator used for gforth (but applicable to other things). I may use this.
Better as in lower overhead? Or better as in simpler to write? The current code is competitive with weave already which would seem to indicate that we're already dominated by the overhead of the FP math, not the big switch statement. I'd only rewrite things if it's going to be simpler to work with.
6. Replace register numbers in the program with actual pointers to the correct register. That should remove a layer of indirection.
This would be done at function calling time as well? So this is more two-stage compilation stuff? -tim
Tim Hochberg wrote:
3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too.
Do you mean that sum() and friends could only occur as the outermost function. That is: "sum(a+3*b)" would work, but: "where(a, sum(a+3*b), c)" would not? Or am I misunderstanding you here? I don't think I like that limitation if that's the case. I don' think it should be necessary either.
OK, I thought about this some more and I think I was mostly wrong. I suspect that reduction does need to happen as the last step. Still it would be nice for "where(a, sum(a+3*b), c)" to work. This could be done by doing the following transformation: a = evaluate("where(a, sum(a+3*b), c)") => temp=evaluate("sum(a+3*b)"); a = evaluate("where(a, temp, c)") I suspect that this this would be fairly easy to do automagically as long as it was at the Python level. That is, numexpr would return a python object that would call the lower level interpreter.numexpr appropriately. This would have some other advantages as well -- it would make it easy to deal with keyword arguments for one. It would also make it easy to do the bytecode rewriting if we decide to go that route. It does add more call overhead, but if that turns out to be we can move stuff into C later. I'm still not excited about summing over the whole output buffer though. That ends up allocating and scanning through a whole extra buffer which may result in a signifigant speed and memory hit for large arrays. Since if we're only doing this on the way out, there should be no problem just allocating a single double (or complex) to do the sum in. On the way in, this could be set to zero or one based on what the last opcode is (sum or product). Then the SUM opcode could simply do something like: BTW, the cleanup of the interpreter looks pretty slick. I'm going to look at timings for using COPY_C versus using add directly and see about reducing the number of opcodes. If this works out OK, the number comparison opcodes could be reduced a lot. Sorry about munging the line endings. -tim
On Tue, Mar 07, 2006 at 11:33:45AM -0700, Tim Hochberg wrote:
Tim Hochberg wrote:
3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too.
I'm still not excited about summing over the whole output buffer though. That ends up allocating and scanning through a whole extra buffer which may result in a signifigant speed and memory hit for large arrays. Since if we're only doing this on the way out, there should be no problem just allocating a single double (or complex) to do the sum in. On the way in, this could be set to zero or one based on what the last opcode is (sum or product). Then the SUM opcode could simply do something like:
No, no, we'd just sum over the 128 element output vector (mem[0]), and add the result to cumulative sum. That vector should already be in cache, as the last op would put it there.
BTW, the cleanup of the interpreter looks pretty slick.
Not finished yet :-) Look for a checkin today (if I have time). -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
On Tue, Mar 07, 2006 at 11:33:45AM -0700, Tim Hochberg wrote:
Tim Hochberg wrote:
3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too.
I'm still not excited about summing over the whole output buffer though. That ends up allocating and scanning through a whole extra buffer which may result in a signifigant speed and memory hit for large arrays. Since if we're only doing this on the way out, there should be no problem just allocating a single double (or complex) to do the sum in. On the way in, this could be set to zero or one based on what the last opcode is (sum or product). Then the SUM opcode could simply do something like:
No, no, we'd just sum over the 128 element output vector (mem[0]), and add the result to cumulative sum. That vector should already be in cache, as the last op would put it there.
Ah! Ok then. That's what I was thinking of too. For some reason I thought you were proposing building the whole result vector then summing it. Here's another wrinkle: how do we deal with:
a = reshape(arange(9), (3,3)) sum(a) array([ 9, 12, 15])
Just forbid it? For the time being at least?
BTW, the cleanup of the interpreter looks pretty slick.
Not finished yet :-) Look for a checkin today (if I have time).
They didn't seem to have any speed advantage, so I ripped out all the compare with constant opcodes amd used COPY_C instead. I'm probably going to rip out OP_WHERE_XXC and OP_WHERE_XCX depending on the timings there. Should I also kill OP_ADD_C and friends as well while I'm at it? -tim
On Tue, Mar 07, 2006 at 12:08:00PM -0700, Tim Hochberg wrote:
David M. Cooke wrote:
On Tue, Mar 07, 2006 at 11:33:45AM -0700, Tim Hochberg wrote:
Tim Hochberg wrote:
3. Reduction. I figure this could be done at the end of the program in each loop: sum or multiply the output register. Downcasting the output could be done here too.
I'm still not excited about summing over the whole output buffer though. That ends up allocating and scanning through a whole extra buffer which may result in a signifigant speed and memory hit for large arrays. Since if we're only doing this on the way out, there should be no problem just allocating a single double (or complex) to do the sum in. On the way in, this could be set to zero or one based on what the last opcode is (sum or product). Then the SUM opcode could simply do something like:
No, no, we'd just sum over the 128 element output vector (mem[0]), and add the result to cumulative sum. That vector should already be in cache, as the last op would put it there.
Ah! Ok then. That's what I was thinking of too. For some reason I thought you were proposing building the whole result vector then summing it.
Here's another wrinkle: how do we deal with:
a = reshape(arange(9), (3,3)) sum(a) array([ 9, 12, 15])
Just forbid it? For the time being at least?
Well, I think for now, sum(a) = sum(a.ravel()). Doing differently will mean working with dimensions, which is something else to figure out.
They didn't seem to have any speed advantage, so I ripped out all the compare with constant opcodes amd used COPY_C instead. I'm probably going to rip out OP_WHERE_XXC and OP_WHERE_XCX depending on the timings there. Should I also kill OP_ADD_C and friends as well while I'm at it?
Don't bother; the version I'm going to check in already has them gone. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
I made some more changes to numexp including adding the optimiztions to pow that we earlier added to numpy. It would be nice to have some way to select the level of optimization. That way, we could work on very aggressive optimization without worrying about messing someone up later who would can't tolerate the moderate accuracy loss. The only idea I have thus far is to create a new Expression each time, instead of just using the precreated 'E'. This would allow us to pass in a context to Expression and this could be passed onto the Nodes that were subsequently created from it so that it would be available where it's needed. That may not make any sense to anyone but David, but I invite you look numexpr/expression.py and numexpr/compiler.py. Then it'll be clear as,..., well clear as something. -tim
I just checked in some changes that do aggressive optimization on the pow operator in numexpr. Now all integral and half integral powers between [-50 and 50] are computed using multiples and sqrt. (Empirically 50 seemed to be the closest round number to the breakeven point.) I mention this primarily because I think it's cool. But also, it's the kind of optimization that I don't think would be feasible in numpy itself short of defining a whole pile of special cases, either separate ufuncs or separate loops within a single ufunc, one for each case that needed optimizing. Otherwise the bookkeeping overhead would overwhelm the savings of replacing pow with multiplies. Now all of the bookkeeping is done in Python, which makes it easy; and done once ahead of time and translated into bytecode, which makes it fast. The actual code that does the optimization is included below for those of you interested enough to care, but not interested enough to check it out of the sandbox. It could be made simpler, but I jump through some hoops to avoid unnecessary mulitplies. For instance, starting 'r' as 'OpNode('ones_like', [a])' would simplify things signifigantly, but at the cost of adding an extra multiply in most cases. That brings up an interesting idea. If 'mul' were made smarter, so that it recognized OpNode('ones_like', [a]) and ConstantNode(1), then not only would that speed some 'mul' cases up, it would simplify the code for 'pow' as well. I'll have to look into that tomorrow. Regards, -tim if True: # Aggressive RANGE = 50 # Approximate break even point with pow(x,y) # Optimize all integral and half integral powers in [-RANGE, RANGE] # Note: for complex numbers RANGE would be larger. if (int(2*x) == 2*x) and (-RANGE <= abs(x) <= RANGE): n = int(abs(x)) ishalfpower = int(abs(2*x)) % 2 r = None p = a mask = 1 while True: if (n & mask): if r is None: r = p else: r = OpNode('mul', [r,p]) mask <<= 1 if mask > n: break p = OpNode('mul', [p,p]) if ishalfpower: sqrta = OpNode('sqrt', [a]) if r is None: r = sqrta else: r = OpNode('mul', [r, sqrta]) if r is None: r = OpNode('ones_like', [a]) if x < 0: r = OpNode('div', [ConstantNode(1), r]) return r
Tim Hochberg <tim.hochberg@cox.net> writes:
I just checked in some changes that do aggressive optimization on the pow operator in numexpr. Now all integral and half integral powers between [-50 and 50] are computed using multiples and sqrt. (Empirically 50 seemed to be the closest round number to the breakeven point.)
I mention this primarily because I think it's cool. But also, it's the kind of optimization that I don't think would be feasible in numpy itself short of defining a whole pile of special cases, either separate ufuncs or separate loops within a single ufunc, one for each case that needed optimizing. Otherwise the bookkeeping overhead would overwhelm the savings of replacing pow with multiplies.
Now all of the bookkeeping is done in Python, which makes it easy; and done once ahead of time and translated into bytecode, which makes it fast. The actual code that does the optimization is included below for those of you interested enough to care, but not interested enough to check it out of the sandbox. It could be made simpler, but I jump through some hoops to avoid unnecessary mulitplies. For instance, starting 'r' as 'OpNode('ones_like', [a])' would simplify things signifigantly, but at the cost of adding an extra multiply in most cases.
That brings up an interesting idea. If 'mul' were made smarter, so that it recognized OpNode('ones_like', [a]) and ConstantNode(1), then not only would that speed some 'mul' cases up, it would simplify the code for 'pow' as well. I'll have to look into that tomorrow.
Instead of using a separate ones_like opcode, why don't you just add a ConstantNode(1) instead? -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
Tim Hochberg <tim.hochberg@cox.net> writes:
I just checked in some changes that do aggressive optimization on the pow operator in numexpr. Now all integral and half integral powers between [-50 and 50] are computed using multiples and sqrt. (Empirically 50 seemed to be the closest round number to the breakeven point.)
I mention this primarily because I think it's cool. But also, it's the kind of optimization that I don't think would be feasible in numpy itself short of defining a whole pile of special cases, either separate ufuncs or separate loops within a single ufunc, one for each case that needed optimizing. Otherwise the bookkeeping overhead would overwhelm the savings of replacing pow with multiplies.
Now all of the bookkeeping is done in Python, which makes it easy; and done once ahead of time and translated into bytecode, which makes it fast. The actual code that does the optimization is included below for those of you interested enough to care, but not interested enough to check it out of the sandbox. It could be made simpler, but I jump through some hoops to avoid unnecessary mulitplies. For instance, starting 'r' as 'OpNode('ones_like', [a])' would simplify things signifigantly, but at the cost of adding an extra multiply in most cases.
That brings up an interesting idea. If 'mul' were made smarter, so that it recognized OpNode('ones_like', [a]) and ConstantNode(1), then not only would that speed some 'mul' cases up, it would simplify the code for 'pow' as well. I'll have to look into that tomorrow.
Instead of using a separate ones_like opcode, why don't you just add a ConstantNode(1) instead?
You think I can remember something like that a month or whatever later? You may be giving me too much credit ;-) Hmm..... Ok. I think I remember. IIRC, the reason is that ConstantNode(1) is slightly different than OpNode('ones_like', [a]) in some corner cases. Specifically pow(a**0) will produce an array of ones in one case, but a scalar 1 in the other case. -tim
participants (4)
-
cookedm@physics.mcmaster.ca
-
David M. Cooke
-
Robert Kern
-
Tim Hochberg