[Cython] gsoc: array expressions

Thu May 24 07:46:20 CEST 2012

On Tue, May 22, 2012 at 6:16 AM, mark florisson
<markflorisson88 at gmail.com> wrote:
> On 22 May 2012 07:48, Robert Bradshaw <robertwb at gmail.com> wrote:
>> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>>
>>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>>>
>>>>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>>>>
>>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> For my gsoc we already have some simple initial ideas, i.e.
>>>>>> elementwise vector expressions (a + b with a and b arrays with
>>>>>> arbitrary rank), I don't think these need any discussion. However,
>>>>>> there are a lot of things that haven't been formally discussed on the
>>>>>> mailing list, so here goes.
>>>>>>
>>>>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>>>>> mailing list, and I think your insights as a Theano developer can be
>>>>>> very helpful in this discussion.
>>>>>>
>>>>>> User Interface
>>>>>> ===========
>>>>>> Besides simple array expressions for dense arrays I would like a
>>>>>> mechanism for "custom ufuncs", although to a different extent to what
>>>>>> Numpy or Numba provide. There are several ways in which we could want
>>>>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>>>>> lambas or Python functions in the same module, or as general objects
>>>>>> (e.g. functions Cython doesn't know about).
>>>>>> To achieve maximum efficiency it will likely be good to allow sharing
>>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>>>>> would prefer annotated def functions where the parameters are
>>>>>> specialized on demand, e.g.
>>>>>>
>>>>>> @elemental
>>>>>> def add(a, b): # elemental functions can have any number of arguments
>>>>>> and operate on any compatible dtype
>>>>>>     return a + b
>>>>>>
>>>>>> When calling cdef functions or elemental functions with memoryview
>>>>>> arguments, the arguments perform a (broadcasted) elementwise
>>>>>> operation. Alternatively, we can have a parallel.elementwise function
>>>>>> which maps the function elementwise, which would also work for object
>>>>>> callables. I prefer the former, since I think it will read much
>>>>>> easier.
>>>>>>
>>>>>> Secondly, we can have a reduce function (and maybe a scan function),
>>>>>> that reduce (respectively scan) in a specified axis or number of axes.
>>>>>> E.g.
>>>>>>
>>>>>>     parallel.reduce(add, a, b, axis=(0, 2))
>>>>>>
>>>>>> where the default for axis is "all axes". As for the default value,
>>>>>> this could be perhaps optionally provided to the elemental decorator.
>>>>>> Otherwise, the reducer will have to get the default values from each
>>>>>> dimension that is reduced in, and then skip those values when
>>>>>> reducing. (Of course, the reducer function must be associate and
>>>>>> commutative). Also, a lambda could be passed in instead of an
>>>>>
>>>>>
>>>>>
>>>>> Only associative, right?
>>>>>
>>>>> Sounds good to me.
>>>>>
>>>>>
>>>>>> elementwise or typed cdef function.
>>>>>>
>>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>>>>> function, which would iterate over N memoryviews, and provide a
>>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>>>>> should provide only the indices, or also the values. e.g. an inplace
>>>>>> elementwise add would read as follows:
>>>>>>
>>>>>>     for i, j, k in parallel.nditerate(A, B):
>>>>>>         A[i, j, k] += B[i, j, k]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think this sounds good; I guess don't see a particular reason for
>>>>> "ndenumerate", I think code like the above is clearer.
>>>>
>>>>
>>>> I'm assuming the index computations would not be re-done in this case
>>>> (i.e. there's more magic going on here than looks like at first
>>>> glance)? Otherwise there is an advantage to ndenumerate.
>>>
>>>
>>> Ideally, there is a lot more magic going on, though I don't know how far
>>> Mark wants to go.
>>>
>>> Imagine "nditerate(A, A.T)", in that case it would have to make many small
>>> tiles so that for each tile being processed, A has a tile in cache and A.T
>>> has another tile in cache (so that one doesn't waste cache line transfers).
>>>
>>> So those array lookups would potentially look up in different memory
>>> buffers, with the strides known at compile time.
>>
>> Yes, being clever about the order in which to iterate over the indices
>> is the hard problem to solve here. I was thinking more in terms of the
>> inner loop iterating over the innermost dimension only to do the
>> indexing (retrieval and assignment), similar to how the generic NumPy
>> iterator works.
>
> That's a valid point, but my experience has been that any worthy C
> compiler will do common subexpression elimination for the outer
> dimensions and not recompute the offset every time. It actually
> generated marginally faster code for scalar assignment than a
> "cascaded pointer assignment", i.e. faster than
>
> p0 = data;
> for (...) {
>    p1 = p0 + i * strides[0]
>    for (...) {
>        p2 = p1 + j * strides[1]
>        ...
>    }
> }
>
> (haven't tried manual strength reduction there though).

That's a good point, though "for(p2=p1; p2 < precomputed; p2 +=
stride1) {...}" is probably a better manual reduction. I concede that
compilers are really smart about this kind of stuff these days though
(though they might not be able to infer that, for example, strides
doesn't change).

- Robert