[Cython] gsoc: array expressions
markflorisson88 at gmail.com
Mon May 21 13:11:49 CEST 2012
On 21 May 2012 12:08, mark florisson <markflorisson88 at gmail.com> wrote:
> On 21 May 2012 11:34, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>> For my gsoc we already have some simple initial ideas, i.e.
>>> elementwise vector expressions (a + b with a and b arrays with
>>> arbitrary rank), I don't think these need any discussion. However,
>>> there are a lot of things that haven't been formally discussed on the
>>> mailing list, so here goes.
>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>> mailing list, and I think your insights as a Theano developer can be
>>> very helpful in this discussion.
>>> User Interface
>>> Besides simple array expressions for dense arrays I would like a
>>> mechanism for "custom ufuncs", although to a different extent to what
>>> Numpy or Numba provide. There are several ways in which we could want
>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>> lambas or Python functions in the same module, or as general objects
>>> (e.g. functions Cython doesn't know about).
>>> To achieve maximum efficiency it will likely be good to allow sharing
>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>> would prefer annotated def functions where the parameters are
>>> specialized on demand, e.g.
>>> def add(a, b): # elemental functions can have any number of arguments
>>> and operate on any compatible dtype
>>> return a + b
>>> When calling cdef functions or elemental functions with memoryview
>>> arguments, the arguments perform a (broadcasted) elementwise
>>> operation. Alternatively, we can have a parallel.elementwise function
>>> which maps the function elementwise, which would also work for object
>>> callables. I prefer the former, since I think it will read much
>>> Secondly, we can have a reduce function (and maybe a scan function),
>>> that reduce (respectively scan) in a specified axis or number of axes.
>>> parallel.reduce(add, a, b, axis=(0, 2))
>>> where the default for axis is "all axes". As for the default value,
>>> this could be perhaps optionally provided to the elemental decorator.
>>> Otherwise, the reducer will have to get the default values from each
>>> dimension that is reduced in, and then skip those values when
>>> reducing. (Of course, the reducer function must be associate and
>>> commutative). Also, a lambda could be passed in instead of an
>> Only associative, right?
>> Sounds good to me.
>>> elementwise or typed cdef function.
>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>> function, which would iterate over N memoryviews, and provide a
>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>> should provide only the indices, or also the values. e.g. an inplace
>>> elementwise add would read as follows:
>>> for i, j, k in parallel.nditerate(A, B):
>>> A[i, j, k] += B[i, j, k]
>> I think this sounds good; I guess don't see a particular reason for
>> "ndenumerate", I think code like the above is clearer.
>> It's perhaps worth at least thinking about how to support "for idx in ...",
>> "A[idx, Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
>> first iteration though.
>> Putting it in "parallel" is nice because prange already have out-of-order
>> semantics.... But of course, there are performance benefits even within a
>> single thread because of the out-of-order aspect. This should at least be a
>> big NOTE box in the documentation.
>>> Frédéric, feel free to correct me at any point here :)
>>> As for the implementation, I think it will be a good idea to at least
>>> reuse (optionally through command line flags) Theano's optimization
>>> pipeline. I think it would be reasonably easy to build a Theano
>>> expression graph (after fusing the expressions in Cython first), run
>>> the Theano optimizations on that and map back to a Cython AST.
>>> Optionally, we could store a pickled graph representation (or even
>>> compiled theano function?), and provide it as an optional
>>> specialization at runtime (but mapping back correctly to memoryviews
>>> where needed, etc). As Numba matures, a numba runtime specialization
>>> could optionally be provided.
>> Can you enlighten us a bit about what Theano's optimizations involve? You
>> mention doing the iteration specializations yourself below, and also the
>> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
>> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
>> by Theano?
>> If so, such optimizations should be done also for our scalar computations,
>> not just vector, right?
> As for this, I don't think CSE is important for scalar computations,
> since if they are objects, you have to go through a Python layer since
> you have no idea what the code does, and if they are C objects, the C
> compiler will readily optimize that out. You want this for vector
> expressions since the computations may be expensive and the C
> compiler may not optimize them. E.g. consider the silly example of v1
> * sum(A) + v2 * sum(A). It's more convenient to write than to
> introduce a new variable manually.
>> Or does Theano deal with the memory access patterns?
>>> As for the implementation of the C specializations, I currently think
>>> we should implement our own, since theano heavily uses the numpy C
>>> API, and since its easier to have an optional theano runtime
>>> specialization anyway. I propose the following specializations, to be
>>> selected at runtime
>>> - vectorized contiguous, collapsed and aligned
>>> - this function can be called by a strided, inner dimension
>>> contiguous specialization
>>> - tiled (ndenumerate/nditerate)
>>> - tiled vectorized
>>> - plain C loops
>>> With 'aligned' it is not meant that the data itself should be aligned,
>>> but that they are aligned at the same (unaligned) offset.
>> Sounds good.
>> About implementing this in Cython, would you simply turn
>> a[...] = b + c
>> for i, j, k in parallel.nditerate(a, b, c):
>> a[i, j, k] = b[i, j, k] + c[i, j, k]
>> and then focus on optimizing nditerate? That seemed like the logical path to
>> me, but perhaps that doesn't play nicely with using Theano to optimize the
>> expression? (Again I need a clearer picture of what this involves.)
> I don't think so, since I think we want to try explicit vectorization.
> I think the iteration and tiling mechanism etc will be shared by both,
> but we wouldn't provide that direct mapping (I think). Although maybe
> we could insert a VectorAssignment with VectorScalars... let's see, in
> any case both will be optimized in the same way, except that with
> vector expressions you know you're always getting the best the
> compiler can do.
(Where of course the nditerate loop striding pattern would be patched
>> (Yes, AMD and I guess probably Intel too seem to have moved towards making
>> unaligned MOV as fast as aligned MOV I guess, so no need to worry about
>>> A runtime compiler could probably do much better here and allow for
>>> shuffling in the vectorized code for a minimal subset of the operands.
>>> Maybe it would be useful to provide a vectorized version where each
>>> operand is shuffled and the shuffle arguments are created up front?
>>> That might still be faster than non-vectorized... Anyway, the most
>>> important part would be tiling and proper memory access patterns.
>>> Which specialization is picked depends on a) which flags were passed
>>> to Cython, b) the runtime memory layout and c) what macros were
>>> defined when the Cython module was compiled.
>>> Criticism and constructive discussion welcome :)
>>> (heading out for lunch now)
>> cython-devel mailing list
>> cython-devel at python.org
More information about the cython-devel