Re: [Numpy-discussion] numexpr thoughts

6 Mar 2006

      Tim Hochberg  writes:
...
1. David already mentioned opcodes to copy from an integer array into
a float register. Another idea would be to have an opcode to copy from
a strided array into a register. This would save a copy for
noncontiguous matrices and opens up the door to doing broadcasting. I
think we'd have to tweak the main interpreter loop a bit, but I think
it's doable. Depending how crunched we feel for opcode space, the
casting array and this array could overlap (OP_CAST_AND_COPY for
instance).
Hadn't thought about doing strided arrays like that; it should work.
...
2. Something I noticed while writing tests is that the rearangement of
operations for 'a/const' to '(1/const)*a'  changes the results
slightly (try const=3). I don't think I care -- I just thought I'd
point it out.
I don't care either :-) Although it may be worthwile throwing in a
compiler option to keep a/const.
...
3. It may simplify things to use copy_c to eliminate a bunch of the
extra bytecodes for operating on functions of more than one argument.
I need to check the timings on this -- I don't know how much of a
speed hit using copy_c will cause.
I'm implementing another solution: I'm getting rid of individual
constant operators, like 'add_c'. Constants will be stored in the
temps array, as 128 repetitions of the constant. That way 'add' (for
instance) can operate on both vector + vector and vector + scalar: it
would take the register numbers for the two vectors for the first one,
and for the second case, the scalar argument would be the register
number for the repeated constant.

Timings on my iBook show this is actually slightly faster. I didn't
expect that :-)
...
4. At some point I plan to add complex operations. My original thought
was to just duplicate the current, floating-point interpreter for
complex numbers. However, if we have opcodes for casting, it might
instead make sense to have one interpreter that works with both. This
would be more efficient for mixed expressions since we wouldnt' have
to case everything complex right away. It looks like there's enough
opcode space, although I worry a bit whether making that main switch
loop huge is going to slow things down. Probably not, but it's
something to keep an eye on.
The way I see it, there should be two machines: one for single
precision, and one for double precision (maybe one for longdoubles,
but I'm not worrying about that now). That's simple enough.

Each machine should work with floats, and the complex form of that
float. Potentially, cast opcodes could be used to convert single to
double, for instance. I'm still thinking about how to fit that.
...
5. Same thoughts as 4., but for integer operations. Potentially the
interpreter could work with longs, doubles and cdoubles, downcasting
as appropriate on the way out.
I suppose this could be considered a "reduction" operation: instead of
collapsing the array, we're collapsing the precision.
...
It's looking like numexpr could come pretty close to evaluating most
numpy expressions if we put enough work into it. None of the required
changes look like they should slow down the original, simple, fast
case. However we should keep an eye on that as we go along. It's all
very cool; I'd like to thank David for providing such a fun toy.
thanks

Right now, here's my thoughts on where to go:

1. I'm rewriting the compiler to be easier to play with. Top things
   on the list are better register allocation, and common
   subexpression elimination.

2. Replace individual constants with "constant arrays": repeations of
   the constant in the vector registers.

3. Reduction. I figure this could be done at the end of the program in
   each loop: sum or multiply the output register. Downcasting the
   output could be done here too.

4. Add complex numbers. If it doesn't look really messy, we could add
   them to the current machine. Otherwise, we could make a machine that
   works on complex numbers only.

5. Currently, we use a big switch statement. There are ways (taken
   from Forth) that are better: indirect and direct threading.
   Unfortunately, it looks the easy way to do these uses GCC's
   capability to take the address of local labels. I'll add that if I
   can refactor the machine enough so that both variants can be
   produced. Have a look at
   http://www.complang.tuwien.ac.at/anton/vmgen/
   which is the virtual machine generator used for gforth (but
   applicable to other things). I may use this.

6. Replace register numbers in the program with actual pointers to the
   correct register. That should remove a layer of indirection.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm@physics.mcmaster.ca

Re: [Numpy-discussion] numexpr thoughts

cookedm＠physics.mcmaster.ca