[Cython] gsoc: array expressions
markflorisson88 at gmail.com
Sun May 20 16:03:09 CEST 2012
For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.
Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
return a + b
When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
parallel.reduce(add, a, b, axis=(0, 2))
where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an
elementwise or typed cdef function.
Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:
for i, j, k in parallel.nditerate(A, B):
A[i, j, k] += B[i, j, k]
Frédéric, feel free to correct me at any point here :)
As for the implementation, I think it will be a good idea to at least
reuse (optionally through command line flags) Theano's optimization
pipeline. I think it would be reasonably easy to build a Theano
expression graph (after fusing the expressions in Cython first), run
the Theano optimizations on that and map back to a Cython AST.
Optionally, we could store a pickled graph representation (or even
compiled theano function?), and provide it as an optional
specialization at runtime (but mapping back correctly to memoryviews
where needed, etc). As Numba matures, a numba runtime specialization
could optionally be provided.
As for the implementation of the C specializations, I currently think
we should implement our own, since theano heavily uses the numpy C
API, and since its easier to have an optional theano runtime
specialization anyway. I propose the following specializations, to be
selected at runtime
- vectorized contiguous, collapsed and aligned
- this function can be called by a strided, inner dimension
- tiled (ndenumerate/nditerate)
- tiled vectorized
- plain C loops
With 'aligned' it is not meant that the data itself should be aligned,
but that they are aligned at the same (unaligned) offset.
A runtime compiler could probably do much better here and allow for
shuffling in the vectorized code for a minimal subset of the operands.
Maybe it would be useful to provide a vectorized version where each
operand is shuffled and the shuffle arguments are created up front?
That might still be faster than non-vectorized... Anyway, the most
important part would be tiling and proper memory access patterns.
Which specialization is picked depends on a) which flags were passed
to Cython, b) the runtime memory layout and c) what macros were
defined when the Cython module was compiled.
Criticism and constructive discussion welcome :)
(heading out for lunch now)
More information about the cython-devel