[Cython] gsoc: array expressions

Tue May 22 15:16:44 CEST 2012

On 22 May 2012 07:48, Robert Bradshaw <robertwb at gmail.com> wrote:
> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no> wrote:
>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>
>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>>
>>>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>>>
>>>>>
>>>>> Hey,
>>>>>
>>>>> For my gsoc we already have some simple initial ideas, i.e.
>>>>> elementwise vector expressions (a + b with a and b arrays with
>>>>> arbitrary rank), I don't think these need any discussion. However,
>>>>> there are a lot of things that haven't been formally discussed on the
>>>>> mailing list, so here goes.
>>>>>
>>>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>>>> mailing list, and I think your insights as a Theano developer can be
>>>>> very helpful in this discussion.
>>>>>
>>>>> User Interface
>>>>> ===========
>>>>> Besides simple array expressions for dense arrays I would like a
>>>>> mechanism for "custom ufuncs", although to a different extent to what
>>>>> Numpy or Numba provide. There are several ways in which we could want
>>>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>>>> lambas or Python functions in the same module, or as general objects
>>>>> (e.g. functions Cython doesn't know about).
>>>>> To achieve maximum efficiency it will likely be good to allow sharing
>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>>>> would prefer annotated def functions where the parameters are
>>>>> specialized on demand, e.g.
>>>>>
>>>>> @elemental
>>>>> def add(a, b): # elemental functions can have any number of arguments
>>>>> and operate on any compatible dtype
>>>>>     return a + b
>>>>>
>>>>> When calling cdef functions or elemental functions with memoryview
>>>>> arguments, the arguments perform a (broadcasted) elementwise
>>>>> operation. Alternatively, we can have a parallel.elementwise function
>>>>> which maps the function elementwise, which would also work for object
>>>>> callables. I prefer the former, since I think it will read much
>>>>> easier.
>>>>>
>>>>> Secondly, we can have a reduce function (and maybe a scan function),
>>>>> that reduce (respectively scan) in a specified axis or number of axes.
>>>>> E.g.
>>>>>
>>>>>     parallel.reduce(add, a, b, axis=(0, 2))
>>>>>
>>>>> where the default for axis is "all axes". As for the default value,
>>>>> this could be perhaps optionally provided to the elemental decorator.
>>>>> Otherwise, the reducer will have to get the default values from each
>>>>> dimension that is reduced in, and then skip those values when
>>>>> reducing. (Of course, the reducer function must be associate and
>>>>> commutative). Also, a lambda could be passed in instead of an
>>>>
>>>>
>>>>
>>>> Only associative, right?
>>>>
>>>> Sounds good to me.
>>>>
>>>>
>>>>> elementwise or typed cdef function.
>>>>>
>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>>>> function, which would iterate over N memoryviews, and provide a
>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>>>> should provide only the indices, or also the values. e.g. an inplace
>>>>> elementwise add would read as follows:
>>>>>
>>>>>     for i, j, k in parallel.nditerate(A, B):
>>>>>         A[i, j, k] += B[i, j, k]
>>>>
>>>>
>>>>
>>>>
>>>> I think this sounds good; I guess don't see a particular reason for
>>>> "ndenumerate", I think code like the above is clearer.
>>>
>>>
>>> I'm assuming the index computations would not be re-done in this case
>>> (i.e. there's more magic going on here than looks like at first
>>> glance)? Otherwise there is an advantage to ndenumerate.
>>
>>
>> Ideally, there is a lot more magic going on, though I don't know how far
>> Mark wants to go.
>>
>> Imagine "nditerate(A, A.T)", in that case it would have to make many small
>> tiles so that for each tile being processed, A has a tile in cache and A.T
>> has another tile in cache (so that one doesn't waste cache line transfers).
>>
>> So those array lookups would potentially look up in different memory
>> buffers, with the strides known at compile time.
>
> Yes, being clever about the order in which to iterate over the indices
> is the hard problem to solve here. I was thinking more in terms of the
> inner loop iterating over the innermost dimension only to do the
> indexing (retrieval and assignment), similar to how the generic NumPy
> iterator works.

That's a valid point, but my experience has been that any worthy C
compiler will do common subexpression elimination for the outer
dimensions and not recompute the offset every time. It actually
generated marginally faster code for scalar assignment than a
"cascaded pointer assignment", i.e. faster than

p0 = data;
for (...) {
    p1 = p0 + i * strides[0]
    for (...) {
        p2 = p1 + j * strides[1]
        ...
    }
}

(haven't tried manual strength reduction there though).

>> Which begs the question: What about this body?
>>
>> if i < 100:
>>    continue
>> else:
>>    A[i, j, k] += B[i - 100, j, k]
>>
>> I guess just fall back to a non-tiled version? One could of course do some
>> shifting of which tiles of B to grab etc., but there's a limit to how smart
>> one should try to be; one could emit a warning and say that one should slice
>> and dice the memoryviews into shape before they are passed to nditerate.
>
> Linear transformations of the index variables could probably be
> handled, but that's certainly not v1 (and not too difficult for the
> user to express manually).
>
> - Robert
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel