[Numpy-discussion] in the NA discussion, what can we agree on?

Fri Nov 4 19:14:54 EDT 2011

On Fri, Nov 4, 2011 at 3:38 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Fri, Nov 4, 2011 at 3:08 PM, T J <tjhnson at gmail.com> wrote:
> > On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> Continuing my theme of looking for consensus first... there are
> >> obviously a ton of ugly corners in here. But my impression is that at
> >> least for some simple cases, it's clear what users want:
> >>
> >> >>> a = [1, IGNORED(2), 3]
> >> # array-with-ignored-values + unignored scalar only affects unignored
> >> values
> >> >>> a + 2
> >> [3, IGNORED(2), 5]
> >> # reduction operations skip ignored values
> >> >>> np.sum(a)
> >> 4
> >>
> >> For example, Gary mentioned the common idiom of wanting to take an
> >> array and subtract off its mean, and he wants to do that while leaving
> >> the masked-out/ignored values unchanged. As long as the above cases
> >> work the way I wrote, we will have
> >>
> >> >>> np.mean(a)
> >> 2
> >> >>> a -= np.mean(a)
> >> >>> a
> >> [-1, IGNORED(2), 1]
> >>
> >> Which I'm pretty sure is the result that he wants. (Gary, is that
> >> right?) Also numpy.ma follows these rules, so that's some additional
> >> evidence that they're reasonable. (And I think part of the confusion
> >> between Lluís and me was that these are the rules that I meant when I
> >> said "non-propagating", but he understood that to mean something
> >> else.)
> >>
> >> So before we start exploring the whole vast space of possible ways to
> >> handle masked-out data, does anyone see any reason to consider rules
> >> that don't have, as a subset, the ones above? Do other rules have any
> >> use cases or user demand? (I *love* playing with clever mathematics
> >> and making things consistent, but there's not much point unless the
> >> end result is something that people will use :-).)
> >
> > I guess I'm just confused on how one, in principle, would distinguish the
> > various forms of propagation that you are suggesting (ie for reductions).
>
> Well, numpy.ma does work this way, so certainly it's possible to do.
> At the code level, np.add() and np.add.reduce() are different entry
> points and can behave differently.
>

I see your point, but that seems like just an API difference with a bad
name.  reduce() is just calling add() a bunch of times, so it seems like it
should behave as add() does.  That we can create different behaviors with
various assignment rules (like Pauli's 'm' for mark-ignored), only makes it
more confusing to me.

    a = 1
    a += 2
    a += IGNORE

    b = 1 + 2 + IGNORE

I think having a == b is essential.  If they can be different, that will
only lead to confusion.  On this point alone, does anyone think it is
acceptable to have a != b?

>
> OTOH, it might be that it's impossible to do *while still maintaining
> other things we care about*... but in that case we should just shake
> our fists at the mathematics and then give up, instead of coming up
> with an elegant system that isn't actually useful. So that's why I
> think we should figure out what's useful first.
>

Agreed.  I'm on the same page.

>
> > I also don't think it is good that we lack commutativity.  If we disallow
> > unignoring, then yes, I agree that what you wrote above is what people
> > want.  But if we are allowed to unignore, then I do not.
>
> I *think* that for the no-unignoring (also known as "MISSING") case,
> we have a pretty clear consensus that we want something like:
>
> >>> a + 2
> [3, MISSING, 5]
> >>> np.sum(a)
> MISSING
> >>> np.sum(a, skip_MISSING=True)
> 4
>
> (Please say if you disagree, but I really hope you don't!) This case
> is also easier, because we don't even have to allow a skip_MISSING
> flag in cases where it doesn't make sense (e.g., unary or binary
> operations) -- it's a convenience feature, so no-one will care if it
> only works when it's useful ;-).
>
>
Yes, in agreement.  I was talking specifically about the IGNORE case.   And
my point is that if we allow people to remove the IGNORE flag and see the
original data (and if the payloads are computed), then we should care about
commutativity:

>>> x = [1, IGNORE(2), 3]
>>> x2 = x.copy()
>>> y = [10, 11, IGNORE(12)]
>>> z = x + y
>>> a = z.sum()
>>> x += y
>>> b = x.sum()
>>> y += x2
>>> c = y.sum()

So, we should have:  a == b == c.
Additionally, if we allow users to unignore data, then we should have:

>>> x = [1, IGNORE(2), 3]
>>> x2 = x.copy()
>>> y = [10, 11, IGNORE(12)]
>>> x += y
>>> aa = unignore(x).sum()
>>> y += x2
>>> bb = unignore(y).sum()
>>> aa == bb
True

Is there agreement on this?

> Also, how does something like this get handled?
> >
> >>>> a = [1, 2, IGNORED(3), NaN]
> >
> > If I were to say, "What is the mean of 'a'?", then I think most of the
> time
> > people would want 1.5.
>
>
> I would want NaN! But that's because the only way I get NaN's is when
> I do dumb things like compute log(0), and again, I want my code to
> tell me that I was dumb instead of just quietly making up a
> meaningless answer.
>
>
That's definitely field specific then.  In probability:  0 = 0 log(0) is a
common idiom.  In NumPy, 0 log(0) gives NaN, so you'd want to ignore then
when summing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111104/351c9c4f/attachment.html>