[Numpy-discussion] @ operator

Charles R Harris charlesr.harris at gmail.com
Thu Sep 11 22:03:03 EDT 2014


On Thu, Sep 11, 2014 at 7:47 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Thu, Sep 11, 2014 at 10:10 AM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>> On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen <pav at iki.fi> wrote:
>>> >>
>>> >> 09.09.2014, 22:52, Charles R Harris kirjoitti:
>>> >> >   1. Should the operator accept array_like for one of the arguments?
>>> >> >   2. Does it need to handle __numpy_ufunc__, or will
>>> >> >   __array_priority__ serve?
>>> >>
>>> >> I think the __matmul__ operator implementation should follow that of
>>> >> __mul__.
>>> >>
>>> >> [clip]
>>> >> >    3. Do we want PyArray_Matmul in the numpy API?
>>> >> >    4. Should a matmul function be supplied by the multiarray module?
>>> >> >
>>> >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__
>>> machinery, or
>>> >> > will __array_priority__ serve?
>>> >>
>>> >> dot() function deals with __numpy_ufunc__, and the matmul() function
>>> >> should behave similarly.
>>> >>
>>> >> It seems dot() uses __array_priority__ for selection of output return
>>> >> subclass, so matmul() probably needs do the same thing.
>>> >>
>>> >> > Note that the type number operators, __add__ and such, currently use
>>> >> > __numpy_ufunc__ in combination with __array_priority__, this in
>>> addition
>>> >> > to
>>> >> > the fact that they are by default using ufuncs that do the same. I'd
>>> >> > rather
>>> >> > that the __*__ operators simply rely on __array_priority__.
>>> >>
>>> >> The whole business of __array_priority__ and __numpy_ufunc__ in the
>>> >> binary ops is solely about when __<op>__ should yield the execution to
>>> >> __r<op>__ of the other object.
>>> >>
>>> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__"
>>> >>
>>> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__
>>> before
>>> >> __numpy_ufunc__, except if array_priority happens to be smaller than
>>> >> that of the other class and your class is not an ndarray subclass".
>>> >>
>>> >> The following binops also do not IIRC respect __array_priority__ in
>>> >> preferring right-hand operand:
>>> >>
>>> >> - in-place operations
>>> >> - comparisons
>>> >>
>>> >> One question here is whether it's possible to change the behavior of
>>> >> __array_priority__ here at all, or whether changes are possible only
>>> in
>>> >> the context of adding new attributes telling Numpy what to do.
>>> >
>>> > I was tempted to make it a generalized ufunc, which would take care of
>>> a lot
>>> > of things, but there is a lot of overhead in those functions. Sounds
>>> like
>>> > the easiest thing is to make it similar to dot, although having an
>>> inplace
>>> > versions complicates the type selection a bit.
>>>
>>> Can we please fix the overhead instead of adding more half-complete
>>> implementations of the same concepts? I feel like this usually ends up
>>> slowing things down in the end, as optimization efforts get divided...
>>>
>>> My vote is:
>>>
>>> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__
>>> methods do (so I guess check __array_priority__ or whatever it is we
>>> always do). I'd also be okay with ignoring __array_priority__ on the
>>> grounds that __numpy_ufunc__ is better, and there's no existing code
>>> relying on __array_priority__ support in __matmul__.
>>>
>>> Having decided that we are actually going to run, they dispatch
>>> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the
>>> in-place version), similarly to how e.g. __add__ dispatches to np.add.
>>>
>>> newdot acts like a standard gufunc with all the standard niceties,
>>> including __numpy_ufunc__ dispatch.
>>>
>>> ("newdot" here is intended as a placeholder name, maybe it should be
>>> np.linalg.matmul or something else to be bikeshed later. I also vote
>>> that eventually 'dot' become an alias for this function, but whether
>>> to do that is an orthogonal discussion for later.)
>>>
>>> If we went the ufunc route, I think we would want three of them,
>> matxvec, vecxmat, and matxmat, because the best inner loops would be
>> different in the three cases, but they couldn't be straight ufuncs
>> themselves, as we don't need the other options, `reduce`, etc., but they
>> can't be exactly like the linalg machinery, because we do want subclasses
>> to be able to override. Hmm...
>>
>> The ufunc machinery has some funky aspects. For instance, there are
>> hardwired checks for `__radd__` and other such operators in
>> PyUFunc_GenericFunction  that allows subclasses to overide the ufunc. Those
>> options should really be part of the PyUFuncObject.
>>
>
> Here are some timing results for the inner product of an integer array
> comparing inner, ufunc (inner1d), and einsum implementations. The np.long
> type was chosen because it is implemented for inner1d and to avoid the
> effect of using cblas.
>
> In [43]: a = ones(100000, dtype=np.long)
>
> In [44]: timeit inner(a,a)
> 10000 loops, best of 3: 55.5 µs per loop
>
> In [45]: timeit inner1d(a,a)
> 10000 loops, best of 3: 56.2 µs per loop
>
> In [46]: timeit einsum('...i,...i',a,a)
> 10000 loops, best of 3: 43.8 µs per loop
>
> In [47]: a = ones(1, dtype=np.long)
>
> In [48]: timeit inner(a,a)
> 1000000 loops, best of 3: 484 ns per loop
>
> In [49]: timeit inner1d(a,a)
> 1000000 loops, best of 3: 741 ns per loop
>
> In [50]: timeit einsum('...i,...i',a,a)
> 1000000 loops, best of 3: 811 ns per loop
>
> For big arrays, einsum has a better inner loop. For small arrays, einsum
> and inner1d suffer from call overhead. Note that the einsum overhead could
> be improved by special casing, and that the various loops could be
> consolidated into best of breed.
>
> The ufunc implementation is easy to do for all cases.
>
>
Same thing for doubles. The speedup due to cblas can be seen, and the
iterator overhead becomes more visible.

In [51]: a = ones(100000, dtype=np.double)

In [52]: timeit inner(a,a)
10000 loops, best of 3: 32.1 µs per loop

In [53]: timeit inner1d(a,a)
10000 loops, best of 3: 134 µs per loop

In [54]: timeit einsum('...i,...i',a,a)
10000 loops, best of 3: 42 µs per loop

In [55]: a = ones(1, dtype=np.double)

In [56]: timeit inner(a,a)
1000000 loops, best of 3: 336 ns per loop

In [57]: timeit inner1d(a,a)
1000000 loops, best of 3: 742 ns per loop

In [58]: timeit einsum('...i,...i',a,a)
1000000 loops, best of 3: 809 ns per loop

But stacked vectors make a difference

In [59]: a = ones((2, 1), dtype=np.double)

In [60]: timeit for i in range(2): inner(a[i],a[i])
1000000 loops, best of 3: 1.39 µs per loop

In [61]: timeit inner1d(a,a)
1000000 loops, best of 3: 749 ns per loop

In [62]: timeit einsum('...i,...i',a,a)
1000000 loops, best of 3: 888 ns per loop

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140911/bfb63660/attachment.html>


More information about the NumPy-Discussion mailing list