[Cython] gsoc: array expressions

mark florisson markflorisson88 at gmail.com
Mon May 21 13:08:33 CEST 2012


On 21 May 2012 11:34, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/20/2012 04:03 PM, mark florisson wrote:
>>
>> Hey,
>>
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>>
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>>
>> User Interface
>> ===========
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>>
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>>
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>>
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>
>>     parallel.reduce(add, a, b, axis=(0, 2))
>>
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
>
>
> Only associative, right?
>
> Sounds good to me.
>
>
>> elementwise or typed cdef function.
>>
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
>
>
>
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.
>
> It's perhaps worth at least thinking about how to support "for idx in ...",
> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
> first iteration though.
>
> Putting it in "parallel" is nice because prange already have out-of-order
> semantics.... But of course, there are performance benefits even within a
> single thread because of the out-of-order aspect. This should at least be a
> big NOTE box in the documentation.
>
>
>>
>> Implementation
>> ===========
>> Frédéric, feel free to correct me at any point here :)
>>
>> As for the implementation, I think it will be a good idea to at least
>> reuse (optionally through command line flags) Theano's optimization
>> pipeline. I think it would be reasonably easy to build a Theano
>> expression graph (after fusing the expressions in Cython first), run
>> the Theano optimizations on that and map back to a Cython AST.
>> Optionally, we could store a pickled graph representation (or even
>> compiled theano function?), and provide it as an optional
>> specialization at runtime (but mapping back correctly to memoryviews
>> where needed, etc). As Numba matures, a numba runtime specialization
>> could optionally be provided.
>
>
> Can you enlighten us a bit about what Theano's optimizations involve? You
> mention doing the iteration specializations yourself below, and also the
> tiling..
>
> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
> by Theano?
>
> If so, such optimizations should be done also for our scalar computations,
> not just vector, right?

As for this, I don't think CSE is important for scalar computations,
since if they are objects, you have to go through a Python layer since
you have no idea what the code does, and if they are C objects, the C
compiler will readily optimize that out. You want this for vector
expressions since the computations may be  expensive and the C
compiler may not optimize them. E.g. consider the silly example of v1
* sum(A) + v2 * sum(A). It's more convenient to write than to
introduce a new variable manually.

> Or does Theano deal with the memory access patterns?
>
>
>>
>> As for the implementation of the C specializations, I currently think
>> we should implement our own, since theano heavily uses the numpy C
>> API, and since its easier to have an optional theano runtime
>> specialization anyway. I propose the following specializations, to be
>> selected at runtime
>>
>>     - vectorized contiguous, collapsed and aligned
>>         - this function can be called by a strided, inner dimension
>> contiguous specialization
>>     - tiled (ndenumerate/nditerate)
>>     - tiled vectorized
>>     - plain C loops
>>
>> With 'aligned' it is not meant that the data itself should be aligned,
>> but that they are aligned at the same (unaligned) offset.
>
>
> Sounds good.
>
> About implementing this in Cython, would you simply turn
>
> a[...] = b + c
>
> into
>
> for i, j, k in parallel.nditerate(a, b, c):
>    a[i, j, k] = b[i, j, k] + c[i, j, k]
>
> and then focus on optimizing nditerate? That seemed like the logical path to
> me, but perhaps that doesn't play nicely with using Theano to optimize the
> expression? (Again I need a clearer picture of what this involves.)

I don't think so, since I think we want to try explicit vectorization.
I think the iteration and tiling mechanism etc will be shared by both,
but we wouldn't provide that direct mapping (I think). Although maybe
we could insert a VectorAssignment with VectorScalars... let's see, in
any case both will be optimized in the same way, except that with
vector expressions you know you're always getting the best the
compiler can do.

> (Yes, AMD and I guess probably Intel too seem to have moved towards making
> unaligned MOV as fast as aligned MOV I guess, so no need to worry about
> that.)
>
>
>> A runtime compiler could probably do much better here and allow for
>> shuffling in the vectorized code for a minimal subset of the operands.
>> Maybe it would be useful to provide a vectorized version where each
>> operand is shuffled and the shuffle arguments are created up front?
>> That might still be faster than non-vectorized... Anyway, the most
>> important part would be tiling and proper memory access patterns.
>>
>> Which specialization is picked depends on a) which flags were passed
>> to Cython, b) the runtime memory layout and c) what macros were
>> defined when the Cython module was compiled.
>>
>> Criticism and constructive discussion welcome :)
>>
>> Cheers,
>>
>> Mark
>> (heading out for lunch now)
>
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel


More information about the cython-devel mailing list