[Numpy-discussion] [help needed] associativity and precedence of '@'

Mon Mar 17 20:54:13 EDT 2014

On Sat, Mar 15, 2014 at 6:28 PM, Nathaniel Smith <njs at pobox.com> wrote:
> Mathematica: instead of having an associativity, a @ b @ c gets
> converted into mdot([a, b, c])

So, I've been thinking about this (thanks to @rfateman for pointing it
out), and wondering if Mathematica's approach is worth following up
more. (It would need to make it past python-dev, of course, but worst
case is just that they say no and we're back where we are now, so we
might as well think it through.)

Here's how it would work:

Currently Python has 3 different kinds of ops: left-associative (most
of them), right-associative (**), and "chaining". Chaining is used for
comparison ops. Example:

   a < b < c

gets parsed to something like

   do_comparison(args=[a, b, c], ops=[lt, lt])

Notice this is very different from either of

  (a < b) < c
  a < (b < c)

Which means that comparisons aren't left- OR right-associative,
they're this other thing, "chaining".

So we could propose adding a 4th kind of op, calling "grouping", which
would be only @. And the idea is that

  a @ b @ c

would be equivalent to

  operator.matmul((a, b, c))

which eventually (see below) becomes a call to

  a.__matmul__((a, b, c))

We'd use exactly the same parsing rules as the chaining ops, so you
can still control evaluation order with parentheses if you want:

  a @ (b @ c) -> matmul((a, matmul((b, c))))
  (a @ b) @ c -> matmul((matmul((a, c)), c))

...but if you don't specify, then each contiguous group of @ operators
gets collected up and handed to __matmul__ together, and the
__matmul__ implementation gets to decide which evaluation strategy to
use.

It's trivially fast for the computer to figure out the best evaluation
order for matrix multiplication, so in practice I think this would
mean that you could just stop worrying about parentheses for multiple
contiguous calls to matmul. Fancier versions of __matmul__ defined on
more specialized non-ndarray classes might even take into account
their specialized knowledge of how expensive different evaluation
orders are for their specific type -- I'm not sure if this actually
happens in practice, but it might. (Like maybe the best way to
evaluate a @ b @ c depends on the sparsity pattern in the various
matrices, or maybe it depends on which matrices are currently on the
GPU and which are in memory? Anyone have any real examples of this?)

(Of course, this same evaluation-order problem arises for *all*
expressions using numpy; being able to optimize whole expressions at a
time is exactly what numexpr/numba/theano/etc. are useful for. So one
could argue that "baking it in" to @ is pointless, if anyone gets
tired of writing parentheses they should just use one of these
libraries. Or maybe evaluation order problems arise so rarely for @
that no-one cares. But OTOH it would be so nice for once to just have
a single best solution -- "always use @ and be happy, it just works"
-- instead of all the caveats we normally do -- "@ is good in some
cases, but in other cases mdot is better, or if you know you can just
use @ with the right parentheses...".)

Of course, we still have to say something about what value a @ b @ c
actually computes. In the PEP semantics, it isn't always associative
-- specifically not if we do Mat @ vec @ Mat. So in this approach, we
still need to decide what
  matmul((Mat, vec, Mat))
should return.

But, this is actually a feature! Because obviously what *should* be
returned in this case is *not* (Mat @ vec) @ Mat, *or* Mat @ (vec @
Mat). Both of those answers are terrible; it's just, if you have an
ordinary left-/right-associative operator, those are your only
options. What *should* be returned is an error. And in this scheme we
get to see the whole @ expression at once, so we actually can raise an
error for such things.

So, this possibly has nicer performance characteristics, and is also
possibly has nicer semantics.

Now, how would this look in terms of the language definition?

As far as the parser and AST go, this would use exactly the same rules
as the chaining ops, so that's easy.

Having parsed, we must evaluate. Likely the most contentious part of
this approach is that we now have an n-arg operator, so the standard
__X__/__rX__ dichotomy won't work, we need to do something like
multiple dispatch. I haven't followed the debate on this issue in
detail, but what I'd propose for this limited context is not to do
anything like "real" multiple dispatch, but just directly generalize
the familiar __rX__ rule to n arguments. The __rX__ rule is how
Python's existing binary operators work: usually to evaluate a # b,
you try a.__foo__, and then b.__foo__ EXCEPT if b is a proper subclass
of a, you try b first. Generalized to >2 arguments, this looks like:

def operator.matmul(args):
    candidates = list(args)
    while candidates:
        candidate = pop_next_candidate(candidates)
        if hasattr(candidate, "__matmul__"):
            result = candidate.__matmul__(args)
            if result is not NotImplemented:
                return result
    raise TypeError

def pop_next_candidate(candidates):
    classes = [c.__class__ for c in candidates]
    # We'll try the left-most remaining candidate...
    for i in range(len(candidates)):
        # ...unless there's a later, untried candidate that's a proper subclass.
        if not has_proper_subclass(classes[i], classes):
            return candidates.pop(i)
    assert False

def has_proper_subclass(class_, other_classes):
    for other_class in other_classes:
        if (issubclass(other_class, class_)
            and not issubclass(class_, other_class)):
            return True
    return False

...which, it turns out, is exactly the lookup rule that
__numpy_ufunc__ will use, so at least it isn't too weird from our
point of view:
    http://docs.scipy.org/doc/numpy-dev/reference/arrays.classes.html#numpy.class.__numpy_ufunc__

There are still plenty of details to think about, e.g.:

What does the in-place operator do? I think
  a @= b @ c
would have to be the same as
  a = a @ (b @ c)
and NOT
  a = a @ b @ c
because otherwise what do you do with
  a @= b @ c + d.

I'm not sure how this would interact with implementing np.dot (which
we'd still need for its out= argument, and will I guess do dispatch
through the __numpy_ufunc__ mechanism?). We should probably work
through this in detail.

We'd still need to pick a precedence for @: grouping is different than
left-associativity, so we can't do same-group, we'd have to pick
either tight-group or weak-group. My gut feeling is that tight-group
makes more sense that weak-group, because if @ is a magical thing that
collects up a group of items, then it is helpful if there's a simple
visual mapping where the group starts just to the left of the first @
and extends just to the right of the last @.

I'm still not at all sure this rigmorale is worth it -- I still think
we need some data on how often people chain together multiple @ or
np.dot calls.

Still, I thought I'd throw this out here and see what people think of it.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org