[Numpy-discussion] miniNEP1: where= argument for ufuncs

Wed Jul 6 15:26:24 EDT 2011

Here's the master copy:
  https://gist.github.com/1068056

But for your commenting convenience, I'll include the current text here:

############################################
A mini-NEP for the where= argument to ufuncs
############################################

To try and make more progress on the whole missing values/masked
arrays/... debate, it seems useful to have a more technical discussion
of the pieces which we *can* agree on. This is the first, which
attempts to nail down the details of the new ``where=`` argument to
ufuncs.

*********
Rationale
*********

It is often useful to apply operations to a subset of your data, and
numpy provides a rich interface for accomplishing this, by combining
indexing operations with ufunc operations, e.g.::

  a[10, mymask] += b
  np.sum(a[which_indices], axis=0)

But any kind of complex indexing necessarily requires making a
temporary copy of (parts of) the underlying array, which can be quite
expensive, and this copying could be avoided by teaching the ufunc
loop to `index as it goes'.

There are strong arguments against doing this. There are tons of cases
like this where one can save some memory by avoiding temporaries, and
we can't build them all into the core -- especially since we also have
more general solutions like numexpr or writing optimized routines
C/Fortran/Cython. Furthermore, this case is a clear violation of
orthogonality -- we already have indexing and ufuncs as separate
things, so adding a second, somewhat crippled implementation of
indexing to ufuncs themselves is a bit ugly. (It would be better if we
could make sure that anything that could be passed to
ndarray.__getitem__ could also be passed to ufuncs with the same
semantics, but this would require substantial refactoring and seems
unlikely to be implemented any time soon.)

However,

***
API
***

A new optional, keyword argument named ``where=`` will be added to all ufuncs.

--------------
Error checking
--------------

If given, this argument must be a boolean array. If ``f`` is a ufunc,
then given a function call like::

  f(a, b, where=mymask)

the following occurs. First, ``mymask`` is coerced to an array if
necessary, but no type conversion is performed. (I.e., we do
``np.asarray(mymask)``.) Next, we check whether ``mymask`` is a
boolean array. If it is not, then we raise an exception. (In the
future it would be nice to support other forms of indexing as well,
such as lists of slices or arrays of integer indices. In order to
preserve this option, we do not want to coerce integers into
booleans.) Next, ``a`` and ``b`` are broadcast against each other,
just as now; this determines the shape of the output array. Then
``mymask`` is broadcast to match this output array shape. (The shape
of the output array cannot be changed by this process -- for example,
having ``a.shape == (10, 1, 1)``, ``b.shape = (1, 10, 1)``,
``mymask.shape == (1, 1, 10)`` will raise an error rather than
returning a new array with shape ``(10, 10, 10)``.)

-----------------------------
Semantics: ufunc ``__call__``
-----------------------------

When simply calling a ufunc with an output argument, e.g.::

  f(a, b, out=c, where=mymask)

then the result is equivalent to::

  c[mymask] = f(a[mymask], b[mymask])

On the other hand, if no output argument is given::

  f(a, b, where=mymask)

then an output array is instantiated as if by calling
``np.empty(shape, dtype=dtype)``, and then treated as above::

  c = np.empty(shape_for(a, b), dtype=dtype_for(f, a, b))
  f(a, b, out=c, where=mymask)
  return c

Note that this means that the output will, in general, contain
uninitialized values.

----------------------------
Semantics: ufunc ``.reduce``
----------------------------

Take an expression like::

  f.reduce(a, axis=0, where=mymask)

This performs the given reduction operation along each column of
``a``, but simply skips any elements where the corresponding entry in
``mymask`` is false. (For ufuncs which have an identity, this is
equivalent to treating the given elements as if they were the
identity.) For example, if ``a`` is a 2-dimensional array and skipping
over the details of broadcasting, dtype selection, etc., the above
operation produces the same result as::

  out = np.empty(a.shape[1])
  for i in xrange(a.shape[1]):
    out[i] = f.reduce(a[mymask[:, i], i])
  return out

--------------------------------
Semantics: ufunc ``.accumulate``
--------------------------------

Accumulation is similar to reduction, except that ``.accumulate``
saves the intermediate values generated during the reduction loop.
Therefore we use the same semantics as for ``.reduce`` above. If ``a``
is 2-d etc., then this expression::

  f.accumulate(a, axis=0, where=mymask)

is equivalent to::

  out = np.empty(a.shape)
  for i in xrange(a.shape[1]):
    out[mymask[:, i], i] = f.accumulate(a[mymask[:, i], i])
  return out

Notice that once again, elements of ``out`` which correspond to False
entries in the mask are left unitialized.

------------------------------
Semantics: ufunc ``.reduceat``
------------------------------

I've never used ``.reduceat``, and 30 seconds staring at the
documentation was not sufficient to wrap my head around it. It's not
obvious to me that a ``where=`` argument even make sense for this? If
not we could just say that it's not supported.

---------------------------
Semantics: ufunc ``.outer``
---------------------------

This is more complicated -- ``.outer`` takes two arrays, which do not
necessarily have the same shape. Therefore we need to also accept two
``where=`` arguments, which I tentatively and uncreatively propose be
called ``where1=`` and ``where2=``. So we have::

  f.outer(a, b, where1=mymask_for_a, where2=mymask_for_b)

This produces an output array with shape ``a.shape + b.shape``, in
which those pairs of elements in ``a`` for which the corresponding
entry in mymask_for_a is True, and in ``b`` for which both
corresponding entry in ``mymask_for_b`` is True, are combined by f and
placed into the appropriate spot. All other entries are left
uninitialized.

********************
Unresolved questions
********************

Does it make sense to support ``where=`` in ``.reduceat`` operations?

Is there any less stupid-looking name than ``where1=`` and ``where2=``
for the ``.outer`` operation? (For that matter, can ``.outer`` be
applied to more than 2 arrays? The docs say it can't, but it's
perfectly well-defined for arbitrary number of arrays too, so maybe we
want an interface that allows for 3-way, 4-way etc. ``.outer``
operations in the future?)

How does this interact with iterator support (e.g., the new
``nditer``)? A section should be added, but I'm not the one to write
it.

-- Nathaniel