[Numpy-discussion] in the NA discussion, what can we agree on?

Fri Nov 4 02:07:48 EDT 2011

On Thu, Nov 3, 2011 at 7:54 PM, Gary Strangman
<strang at nmr.mgh.harvard.edu> wrote:
> For the non-destructive+propagating case, do I understand correctly that
> this would mean I (as a user) could temporarily decide to IGNORE certain
> portions of my data, perform a series of computation on that data, and the
> IGNORED flag (or however it is implemented) would be propagated from
> computation to computation? If that's the case, I suspect I'd use it all
> the time ... to effectively perform data subsetting without generating
> (partial) copies of large datasets. But maybe I misunderstand the
> intended notion of propagation ...

I *think* it's more subtle than that, but I admit I'm somewhat
confused about how exactly people would want IGNORED to work in
various corner cases. (This is another part of why figuring out our
audience/use-cases seems like an important first step to me...
fortunately the semantics for MISSING are, I think, much more clear.)

Say we have
  >>> a = np.array([1, IGNORED(2), 3])
  >>> b = np.array([10, 20, 30])
(Here's I'm using IGNORED(2) to mean a value that is currently
ignored, but if you unmasked it it would have the value 2.)

Then we have:

# non-propagating *or* propagating, doesn't matter:
>>> a + 2
[3, IGNORED(2), 5]

# non-propagating:
>>> a + b
One of these, I don't know which:
  [11, IGNORED(2), 33]  # numpy.ma chooses this
  [11, 20, 33]
  "Error: shape mismatch"

(An error is maybe the most *consistent* option; the suggestion in the
alterNEP was that masks had to match on all axes that were *not*
broadcast, so a + 2 and a + a are okay, but a + b is an error. I
assume the numpy.ma approach is also useful, but note that it has the
surprising effect that addition is not commutative: IGNORED(x) +
IGNORED(y) = IGNORED(x). Try it:
   masked1 = np.ma.masked_array([1, 2, 3], mask=[False, True, False])
   masked2 = np.ma.masked_array([10, 20, 30], mask=[False, True, False])
   np.asarray(masked1 + masked2) # [11, 2, 33]
   np.asarray(masked2 + masked1) # [11, 20, 33]
I don't really know what people would prefer.)

# propagating:
>>> a + b
One of these, I don't know which:
  [11, IGNORED(2), 33] # same as numpy.ma, again
  [11, IGNORED(22), 33]

# non-propagating:
>>> np.sum(a)
4

# propagating:
>>> np.sum(a)
One of these, I don't know which:
  IGNORED(4)
  IGNORED(6)

So from your description, I wouldn't say that you necessarily want
non-destructive+propagating -- it really depends on exactly what
computations you want to perform, and how you expect them to work. The
main difference is how reduction operations are treated. I kind of
feel like the non-propagating version makes more sense overall, but I
don't know if there's any consensus on that.

(You also have the option of just using the new where= argument to
your ufuncs, which avoids some of this confusion because it gives a
single mask that would apply to the whole operation. The ambiguities
here arise because it's not clear what to do when applying a binary
operation to two arrays that have different masks.)

Maybe you could give some examples of the kinds of computations you're
thinking of?

-- Nathaniel