[Cython] gsoc: array expressions

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Tue May 22 08:57:51 CEST 2012

On 05/22/2012 08:57 AM, Dag Sverre Seljebotn wrote:
> On 05/22/2012 08:48 AM, Robert Bradshaw wrote:
>> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>>>> Hey,
>>>>>> For my gsoc we already have some simple initial ideas, i.e.
>>>>>> elementwise vector expressions (a + b with a and b arrays with
>>>>>> arbitrary rank), I don't think these need any discussion. However,
>>>>>> there are a lot of things that haven't been formally discussed on the
>>>>>> mailing list, so here goes.
>>>>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>>>>> mailing list, and I think your insights as a Theano developer can be
>>>>>> very helpful in this discussion.
>>>>>> User Interface
>>>>>> ===========
>>>>>> Besides simple array expressions for dense arrays I would like a
>>>>>> mechanism for "custom ufuncs", although to a different extent to what
>>>>>> Numpy or Numba provide. There are several ways in which we could want
>>>>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>>>>> lambas or Python functions in the same module, or as general objects
>>>>>> (e.g. functions Cython doesn't know about).
>>>>>> To achieve maximum efficiency it will likely be good to allow sharing
>>>>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>>>>> would prefer annotated def functions where the parameters are
>>>>>> specialized on demand, e.g.
>>>>>> @elemental
>>>>>> def add(a, b): # elemental functions can have any number of arguments
>>>>>> and operate on any compatible dtype
>>>>>> return a + b
>>>>>> When calling cdef functions or elemental functions with memoryview
>>>>>> arguments, the arguments perform a (broadcasted) elementwise
>>>>>> operation. Alternatively, we can have a parallel.elementwise function
>>>>>> which maps the function elementwise, which would also work for object
>>>>>> callables. I prefer the former, since I think it will read much
>>>>>> easier.
>>>>>> Secondly, we can have a reduce function (and maybe a scan function),
>>>>>> that reduce (respectively scan) in a specified axis or number of
>>>>>> axes.
>>>>>> E.g.
>>>>>> parallel.reduce(add, a, b, axis=(0, 2))
>>>>>> where the default for axis is "all axes". As for the default value,
>>>>>> this could be perhaps optionally provided to the elemental decorator.
>>>>>> Otherwise, the reducer will have to get the default values from each
>>>>>> dimension that is reduced in, and then skip those values when
>>>>>> reducing. (Of course, the reducer function must be associate and
>>>>>> commutative). Also, a lambda could be passed in instead of an
>>>>> Only associative, right?
>>>>> Sounds good to me.
>>>>>> elementwise or typed cdef function.
>>>>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>>>>> function, which would iterate over N memoryviews, and provide a
>>>>>> sensible memory access pattern (like numpy.nditer). I'm not sure
>>>>>> if it
>>>>>> should provide only the indices, or also the values. e.g. an inplace
>>>>>> elementwise add would read as follows:
>>>>>> for i, j, k in parallel.nditerate(A, B):
>>>>>> A[i, j, k] += B[i, j, k]
>>>>> I think this sounds good; I guess don't see a particular reason for
>>>>> "ndenumerate", I think code like the above is clearer.
>>>> I'm assuming the index computations would not be re-done in this case
>>>> (i.e. there's more magic going on here than looks like at first
>>>> glance)? Otherwise there is an advantage to ndenumerate.
>>> Ideally, there is a lot more magic going on, though I don't know how far
>>> Mark wants to go.
>>> Imagine "nditerate(A, A.T)", in that case it would have to make many
>>> small
>>> tiles so that for each tile being processed, A has a tile in cache
>>> and A.T
>>> has another tile in cache (so that one doesn't waste cache line
>>> transfers).
>>> So those array lookups would potentially look up in different memory
>>> buffers, with the strides known at compile time.
>> Yes, being clever about the order in which to iterate over the indices
>> is the hard problem to solve here. I was thinking more in terms of the
>> inner loop iterating over the innermost dimension only to do the
>> indexing (retrieval and assignment), similar to how the generic NumPy
>> iterator works.
> The point isn't only being clever about the *order*...you need "copy-in,
> copy-out".
> The point is that the NumPy iterator is not good enough (for
> out-of-cache situations). Since you grab a cache line (64 bytes) each
> time from main memory, a plain NumPy broadcasted iterator throws away a
> lot of memory for "A + A.T", since for ndim>1 there's NO iteration order
> which isn't bad (for instance, you could iterate in the order of A, and
> the result would be that for each element of A.T you fetch there is 64
> bytes transferred).

I meant, "throws away a lot of memory *bandwidth*".


> So the solution is to copy A.T block-wise to a temporary scratch space
> in cache so that you use all the elements in the cache line before
> throwing it out of cache.
> In C, I've seen a simple blocking transpose operation be over four times
> faster than the brute-force transpose for this reason.
> Dag
>>> Which begs the question: What about this body?
>>> if i< 100:
>>> continue
>>> else:
>>> A[i, j, k] += B[i - 100, j, k]
>>> I guess just fall back to a non-tiled version? One could of course do
>>> some
>>> shifting of which tiles of B to grab etc., but there's a limit to how
>>> smart
>>> one should try to be; one could emit a warning and say that one
>>> should slice
>>> and dice the memoryviews into shape before they are passed to nditerate.
>> Linear transformations of the index variables could probably be
>> handled, but that's certainly not v1 (and not too difficult for the
>> user to express manually).
>> - Robert
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel

More information about the cython-devel mailing list