[Cython] gsoc: array expressions

Tue May 22 09:13:17 CEST 2012

On 05/22/2012 09:06 AM, Robert Bradshaw wrote:
> On Mon, May 21, 2012 at 11:57 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 05/22/2012 08:48 AM, Robert Bradshaw wrote:
>>>
>>> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>
>>>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>>>
>>>>>
>>>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>>>> <d.s.seljebotn at astro.uio.no>      wrote:
>>>>>>
>>>>>>
>>>>>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> For my gsoc we already have some simple initial ideas, i.e.
>>>>>>> elementwise vector expressions (a + b with a and b arrays with
>>>>>>> arbitrary rank), I don't think these need any discussion. However,
>>>>>>> there are a lot of things that haven't been formally discussed on the
>>>>>>> mailing list, so here goes.
>>>>>>>
>>>>>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>>>>>> mailing list, and I think your insights as a Theano developer can be
>>>>>>> very helpful in this discussion.
>>>>>>>
>>>>>>> User Interface
>>>>>>> ===========
>>>>>>> Besides simple array expressions for dense arrays I would like a
>>>>>>> mechanism for "custom ufuncs", although to a different extent to what
>>>>>>> Numpy or Numba provide. There are several ways in which we could want
>>>>>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>>>>>> lambas or Python functions in the same module, or as general objects
>>>>>>> (e.g. functions Cython doesn't know about).
>>>>>>> To achieve maximum efficiency it will likely be good to allow sharing
>>>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>>>>>> would prefer annotated def functions where the parameters are
>>>>>>> specialized on demand, e.g.
>>>>>>>
>>>>>>> @elemental
>>>>>>> def add(a, b): # elemental functions can have any number of arguments
>>>>>>> and operate on any compatible dtype
>>>>>>>      return a + b
>>>>>>>
>>>>>>> When calling cdef functions or elemental functions with memoryview
>>>>>>> arguments, the arguments perform a (broadcasted) elementwise
>>>>>>> operation. Alternatively, we can have a parallel.elementwise function
>>>>>>> which maps the function elementwise, which would also work for object
>>>>>>> callables. I prefer the former, since I think it will read much
>>>>>>> easier.
>>>>>>>
>>>>>>> Secondly, we can have a reduce function (and maybe a scan function),
>>>>>>> that reduce (respectively scan) in a specified axis or number of axes.
>>>>>>> E.g.
>>>>>>>
>>>>>>>      parallel.reduce(add, a, b, axis=(0, 2))
>>>>>>>
>>>>>>> where the default for axis is "all axes". As for the default value,
>>>>>>> this could be perhaps optionally provided to the elemental decorator.
>>>>>>> Otherwise, the reducer will have to get the default values from each
>>>>>>> dimension that is reduced in, and then skip those values when
>>>>>>> reducing. (Of course, the reducer function must be associate and
>>>>>>> commutative). Also, a lambda could be passed in instead of an
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Only associative, right?
>>>>>>
>>>>>> Sounds good to me.
>>>>>>
>>>>>>
>>>>>>> elementwise or typed cdef function.
>>>>>>>
>>>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>>>>>> function, which would iterate over N memoryviews, and provide a
>>>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>>>>>> should provide only the indices, or also the values. e.g. an inplace
>>>>>>> elementwise add would read as follows:
>>>>>>>
>>>>>>>      for i, j, k in parallel.nditerate(A, B):
>>>>>>>          A[i, j, k] += B[i, j, k]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think this sounds good; I guess don't see a particular reason for
>>>>>> "ndenumerate", I think code like the above is clearer.
>>>>>
>>>>>
>>>>>
>>>>> I'm assuming the index computations would not be re-done in this case
>>>>> (i.e. there's more magic going on here than looks like at first
>>>>> glance)? Otherwise there is an advantage to ndenumerate.
>>>>
>>>>
>>>>
>>>> Ideally, there is a lot more magic going on, though I don't know how far
>>>> Mark wants to go.
>>>>
>>>> Imagine "nditerate(A, A.T)", in that case it would have to make many
>>>> small
>>>> tiles so that for each tile being processed, A has a tile in cache and
>>>> A.T
>>>> has another tile in cache (so that one doesn't waste cache line
>>>> transfers).
>>>>
>>>> So those array lookups would potentially look up in different memory
>>>> buffers, with the strides known at compile time.
>>>
>>>
>>> Yes, being clever about the order in which to iterate over the indices
>>> is the hard problem to solve here. I was thinking more in terms of the
>>> inner loop iterating over the innermost dimension only to do the
>>> indexing (retrieval and assignment), similar to how the generic NumPy
>>> iterator works.
>>
>>
>> The point isn't only being clever about the *order*...you need "copy-in,
>> copy-out".
>>
>> The point is that the NumPy iterator is not good enough (for out-of-cache
>> situations). Since you grab a cache line (64 bytes) each time from main
>> memory, a plain NumPy broadcasted iterator throws away a lot of memory for
>> "A + A.T", since for ndim>1 there's NO iteration order which isn't bad (for
>> instance, you could iterate in the order of A, and the result would be that
>> for each element of A.T you fetch there is 64 bytes transferred).
>>
>> So the solution is to copy A.T block-wise to a temporary scratch space in
>> cache so that you use all the elements in the cache line before throwing it
>> out of cache.
>>
>> In C, I've seen a simple blocking transpose operation be over four times
>> faster than the brute-force transpose for this reason.
>
> Yes, I understand this. Truly element-wise arithmetic with arrays of
> the same memory layout (or even size) is not that uncommon though, and
> should be optimized for as well. Fortunately, I feel pretty
> comfortable sitting back and watching 'cause you've both thought about
> these issues far more than I and I don't see you both getting it wrong
> :).

Sorry for being an such an annoying know-it-all, it just seemed from 
your comment like you didn't know :-)

Dag