[Cython] gsoc: array expressions

mark florisson markflorisson88 at gmail.com
Mon May 21 12:56:10 CEST 2012

On 21 May 2012 11:34, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/20/2012 04:03 PM, mark florisson wrote:
>> Hey,
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>> User Interface
>> ===========
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>     parallel.reduce(add, a, b, axis=(0, 2))
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
> Only associative, right?
> Sounds good to me.

Ah, I guess, because we can reduce thead-local results manually in a
specified (elementwise) order (I was thinking of generating OpenMP
annotated loops, that can be enabled/disabled at the C level, with an
'if' clause with a sensible lower bound of iterations required).

>> elementwise or typed cdef function.
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.
> It's perhaps worth at least thinking about how to support "for idx in ...",
> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
> first iteration though.

Yeah, definitely.

> Putting it in "parallel" is nice because prange already have out-of-order
> semantics.... But of course, there are performance benefits even within a
> single thread because of the out-of-order aspect. This should at least be a
> big NOTE box in the documentation.
>> Implementation
>> ===========
>> Frédéric, feel free to correct me at any point here :)
>> As for the implementation, I think it will be a good idea to at least
>> reuse (optionally through command line flags) Theano's optimization
>> pipeline. I think it would be reasonably easy to build a Theano
>> expression graph (after fusing the expressions in Cython first), run
>> the Theano optimizations on that and map back to a Cython AST.
>> Optionally, we could store a pickled graph representation (or even
>> compiled theano function?), and provide it as an optional
>> specialization at runtime (but mapping back correctly to memoryviews
>> where needed, etc). As Numba matures, a numba runtime specialization
>> could optionally be provided.
> Can you enlighten us a bit about what Theano's optimizations involve? You
> mention doing the iteration specializations yourself below, and also the
> tiling..
> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
> by Theano?

Yes, it does those kind of things, and it also eliminates common
subexpressions, and it transforms certain expressions to BLAS/LAPACK
functionality. I'm not sure we want that specifically. I'm thinking it
might be more fruitful to start off with a theano-only specialization,
and implement low-level code generation in Theano, and use that from
Cython by either directly dumping in the code, or deferring that to
Theano. At this point I'm not entirely sure.

> If so, such optimizations should be done also for our scalar computations,
> not just vector, right?
> Or does Theano deal with the memory access patterns?

I think it does so for the CUDA backend, but not for the C++ backend.
I think we need to discuss this stuff on the theano mailing list.

>> As for the implementation of the C specializations, I currently think
>> we should implement our own, since theano heavily uses the numpy C
>> API, and since its easier to have an optional theano runtime
>> specialization anyway. I propose the following specializations, to be
>> selected at runtime
>>     - vectorized contiguous, collapsed and aligned
>>         - this function can be called by a strided, inner dimension
>> contiguous specialization
>>     - tiled (ndenumerate/nditerate)
>>     - tiled vectorized
>>     - plain C loops
>> With 'aligned' it is not meant that the data itself should be aligned,
>> but that they are aligned at the same (unaligned) offset.
> Sounds good.
> About implementing this in Cython, would you simply turn
> a[...] = b + c
> into
> for i, j, k in parallel.nditerate(a, b, c):
>    a[i, j, k] = b[i, j, k] + c[i, j, k]
> and then focus on optimizing nditerate? That seemed like the logical path to
> me, but perhaps that doesn't play nicely with using Theano to optimize the
> expression? (Again I need a clearer picture of what this involves.)
> (Yes, AMD and I guess probably Intel too seem to have moved towards making
> unaligned MOV as fast as aligned MOV I guess, so no need to worry about
> that.)
>> A runtime compiler could probably do much better here and allow for
>> shuffling in the vectorized code for a minimal subset of the operands.
>> Maybe it would be useful to provide a vectorized version where each
>> operand is shuffled and the shuffle arguments are created up front?
>> That might still be faster than non-vectorized... Anyway, the most
>> important part would be tiling and proper memory access patterns.
>> Which specialization is picked depends on a) which flags were passed
>> to Cython, b) the runtime memory layout and c) what macros were
>> defined when the Cython module was compiled.
>> Criticism and constructive discussion welcome :)
>> Cheers,
>> Mark
>> (heading out for lunch now)
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel

More information about the cython-devel mailing list