# [Numpy-discussion] Bringing order to higher dimensional operations

Nathaniel Smith njs at pobox.com
Thu Jul 18 08:52:00 EDT 2013

```Hi all,

I hadn't realized until Pauli just pointed it out that np.dot and the
new gufuncs actually have different rules for how they handle extra
axes:
https://github.com/numpy/numpy/pull/3524#issuecomment-21117601
This is turning into a big mess, and it may take some hard decisions to fix it.

Before I explain the problem, a bit of terminology: a "vectorized
operation" is built by taking an "intrinsic operation" and putting a
loop around it, to apply it many times. So for np.add for example, the
intrinsic operation is scalar addition, and if you put a loop around

The question then is, given some input arrays: which parts do you loop
over, and which things you pass to the intrinsic operation?

In the case of scalar operations (like classic ufuncs), this is pretty
straightforward: we broadcast the input arrays together, and loop over
them in parallel. So
np.add(ones((2, 3, 4)), ones((3, 4))).shape == (2, 3, 4)
The other obvious option is, instead of looping over the two arrays in
parallel, instead find all combinations. This is what the .outer
method on ufuncs does, so
np.add.outer(ones((2, 3)), ones((4, 5))).shape == (2, 3, 4, 5)

Now, that's for vectorized versions scalar operations. We also have a
bunch of vectorized operations whose "intrinsic operation" is itself a
function over multidimensional arrays. For example, reduction
operations like 'sum' intrinsically take a 1-dim array and return a
0-dim (scalar), but they can be vectorized to apply to a single axis
of a >1-dim array. Or matrix multiplication intrinsically takes two
2-dim arrays and returns a 2-dim array; it can be vectorized to apply
to >2 dim inputs.

(As shorthand I'll write "takes two 2-dim arrays and returns a 2-dim
array" as "2,2->2"; so 'sum' is 1->0 and 'cumsum' is 1->1.)

-----

Okay, now I can explain the problem. For vectorized multidimensional
operations, we have four (!) different conventions deployed:

Convention 1: Ufunc .reduce (1->0), .accumulate (1->1), .reduceat
(1,1->1): These pick the 0th axis for the intrinsic axis and loop over
the rest. (By default; can be modified by specifying axis.)
np.add.reduce(np.ones((2, 3, 4))).shape == (3, 4)
np.add.accumulate(np.ones((2, 3, 4))).shape == (2, 3, 4)

Convention 2: Reduction (1->0) and accumulation (1->1) operations
defined as top-level functions and ndarray methods (np.sum,
ndarray.sum, np.mean, np.cumprod, etc.): These flatten the array and
use the whole thing as the intrinsic axis. (By default; can be
modified by specifying axis=.)
np.sum(np.ones((2, 3, 4))).shape == ()
np.cumsum(np.ones((2, 3, 4))).shape == (24,)

Convention 3: gufuncs (any->any): These take the *last* k axes for the
intrinsic axes, and then broadcast and parallel-loop over the rest.
Cannot currently be modified.
gu_dot = np.linalg._umath_linalg.matrix_multiply # requires current master
gu_dot(np.ones((2, 3, 4, 5)), np.ones((1, 3, 5, 6))).shape == (2, 3, 4, 6)

(So what's happened here is that the gufunc pulled off the last two
axes of each array, so the intrinsic operation is always going to be a
matrix multiply of a (4, 5) array by a (5, 6) array, producing a (4,
6) array. Then it broadcast the remaining axes together: (2, 3) and
(1, 3) broadcast to (2, 3), and did a parallel iteration over them:
output[0, 0, :, :] is the result of dot(input1[0, 0, :, :], input2[0,
0, :, :]).)

Convention 4: np.dot (2->2): this one is bizarre:
np.dot(np.ones((1, 2, 10, 11)), np.ones((101, 102, 11, 12)).shape
== (1, 2, 10, 101, 102, 12)

So what it's done is picked the last two axes to be the intrinsic
axes, just like the gufunc -- so it always does a bunch of matrix
multiplies of a (10, 11) array with an (11, 12) array. But then it
didn't do a ufunc-style parallel loop. Instead it did a
ufunc.outer-style outer loop, in which it found all n^2 ways of
matching up a matrix in the first input with a matrix in the second
input, and took the dot of each. And then it packed these up into an
array with a rather weird shape: first all the looping axes from the
first input, then the first axis of the output matrix, then all the
looping axes from the second input, and then finally the second axis
of the output matrix.

-----

There are plenty of general reasons to want to simplify this -- it'd
make numpy easier to explain and understand, simplify the code, etc.
-- but also a few more specific reasons that make it urgent:

- gufuncs haven't really been used much yet, so maybe it'd be easy to
change how they work now. But in the next release, everything in
np.linalg will become a gufunc, so it'll become much harder to change
after that.

- we'd really like np.dot to become a gufunc -- in particular, in
combination with Blake's work on ufunc overrides, this would allow
np.dot() to work on scipy.sparse matrices.

- pretty soon we'll want to turn things like 'mean' into gufuncs too,
for the same reason.

-----

Okay, what to do?

The gufunc convention actually seems like the right one to me. This is
unfortunate, because it's also the only one we could easily change
:-(. But we obviously want our vectorized operations to do
broadcasting by default, both for consistency with scalar ufuncs, and
because it just seems to be the most useful convention. So that rules
out the np.dot convention. Then given that we're broadcasting, the two
options are to pick intrinsic axes from the right like gufuncs do, or
to follow the ufunc.reduce convention and pick intrinsic axes from the
left. But picking from the left seems confusing to me, because
broadcasting is itself a right-to-left operation. This doesn't matter
for .reduce and such because they only take one input, but for
something like 'dot', it means you can have 1's inserted in the
"middle" of the array, and then broadcast up to a higher dimension.
Compare:
gu_dot_leftwards(ones((10, 11, 4)), ones((11, 12, 3, 4))) -> (10, 12, 3, 4)
versus
gu_dot_rightwards(ones((4, 10, 11)), ones((3, 4, 11, 12))) -> (3, 4, 10, 12)
To me, it's easier to figure out which axes end up where in the second
case. Working from the right, we take two axes to be the intrinsic
axes, then we match up the next axis (4 matches 4), then we append a 1
and match up the last axis (1 broadcasts to match 3).

So:

QUESTION 1: does that sound right: that in a perfect world, the
current gufunc convention would be the only one, and that's what we
should work towards, at least in the cases where that's possible?

QUESTION 2: Assuming that's right, it would be *really nice* if we
could at least get np.dot onto our new convention, for consistency
with the rest of np.linalg, and to allow it to be overridden. I'm sort
of terrified to touch np.dot's API, but the only cases where it would
act differently is when *both* arguments have *3 or more dimensions*,
and I guess there are very very few users who fall into that category.
So maybe we could start raising some big FutureWarnings for this case
in the next release, and eventually switch?

(I'm even more terrified of trying to mess with np.sum or
np.add.reduce, so I'll leave that alone for now -- maybe we're just
stuck with them. And at least they do already go through the ufunc
machinery.)

-n

```