<br><div class="gmail_quote">On Fri, Nov 4, 2011 at 3:38 PM, Nathaniel Smith <span dir="ltr"><<a href="mailto:njs@pobox.com">njs@pobox.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">On Fri, Nov 4, 2011 at 3:08 PM, T J <<a href="mailto:tjhnson@gmail.com">tjhnson@gmail.com</a>> wrote:<br>

> On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>> wrote:<br>

</div><div><div></div><div class="h5">>> Continuing my theme of looking for consensus first... there are<br>

>> obviously a ton of ugly corners in here. But my impression is that at<br>

>> least for some simple cases, it's clear what users want:<br>

>><br>

>> >>> a = [1, IGNORED(2), 3]<br>

>> # array-with-ignored-values + unignored scalar only affects unignored<br>

>> values<br>

>> >>> a + 2<br>

>> [3, IGNORED(2), 5]<br>

>> # reduction operations skip ignored values<br>

>> >>> np.sum(a)<br>

>> 4<br>

>><br>

>> For example, Gary mentioned the common idiom of wanting to take an<br>

>> array and subtract off its mean, and he wants to do that while leaving<br>

>> the masked-out/ignored values unchanged. As long as the above cases<br>

>> work the way I wrote, we will have<br>

>><br>

>> >>> np.mean(a)<br>

>> 2<br>

>> >>> a -= np.mean(a)<br>

>> >>> a<br>

>> [-1, IGNORED(2), 1]<br>

>><br>

>> Which I'm pretty sure is the result that he wants. (Gary, is that<br>

>> right?) Also <a href="http://numpy.ma" target="_blank">numpy.ma</a> follows these rules, so that's some additional<br>

>> evidence that they're reasonable. (And I think part of the confusion<br>

>> between Lluís and me was that these are the rules that I meant when I<br>

>> said "non-propagating", but he understood that to mean something<br>

>> else.)<br>

>><br>

>> So before we start exploring the whole vast space of possible ways to<br>

>> handle masked-out data, does anyone see any reason to consider rules<br>

>> that don't have, as a subset, the ones above? Do other rules have any<br>

>> use cases or user demand? (I *love* playing with clever mathematics<br>

>> and making things consistent, but there's not much point unless the<br>

>> end result is something that people will use :-).)<br>

><br>

> I guess I'm just confused on how one, in principle, would distinguish the<br>

> various forms of propagation that you are suggesting (ie for reductions).<br>

<br>

</div></div>Well, <a href="http://numpy.ma" target="_blank">numpy.ma</a> does work this way, so certainly it's possible to do.<br>

At the code level, np.add() and np.add.reduce() are different entry<br>

points and can behave differently.<br></blockquote><div><br>I see your point, but that seems like just an API difference with a bad name.  reduce() is just calling add() a bunch of times, so it seems like it should behave as add() does.  That we can create different behaviors with various assignment rules (like Pauli's 'm' for mark-ignored), only makes it more confusing to me.<br>

<br>    a = 1<br>    a += 2<br>    a += IGNORE<br><br>    b = 1 + 2 + IGNORE<br><br>I think having a == b is essential.  If they can be different, that will only lead to confusion.  On this point alone, does anyone think it is acceptable to have a != b?<br>

  <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

OTOH, it might be that it's impossible to do *while still maintaining<br>

other things we care about*... but in that case we should just shake<br>

our fists at the mathematics and then give up, instead of coming up<br>

with an elegant system that isn't actually useful. So that's why I<br>

think we should figure out what's useful first.<br></blockquote><div><br>Agreed.  I'm on the same page.<br> <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


<div class="im"><br>

> I also don't think it is good that we lack commutativity.  If we disallow<br>

> unignoring, then yes, I agree that what you wrote above is what people<br>

> want.  But if we are allowed to unignore, then I do not.<br>

<br>

</div>I *think* that for the no-unignoring (also known as "MISSING") case,<br>

we have a pretty clear consensus that we want something like:<br>

<br>

>>> a + 2<br>

[3, MISSING, 5]<br>

>>> np.sum(a)<br>

MISSING<br>

>>> np.sum(a, skip_MISSING=True)<br>

4<br>

<br>

(Please say if you disagree, but I really hope you don't!) This case<br>

is also easier, because we don't even have to allow a skip_MISSING<br>

flag in cases where it doesn't make sense (e.g., unary or binary<br>

operations) -- it's a convenience feature, so no-one will care if it<br>

only works when it's useful ;-).<br>

<br></blockquote><div><br>Yes, in agreement.  I was talking specifically about the IGNORE case.   And my point is that if we allow people to remove the IGNORE flag and see the original data (and if the payloads are computed), then we should care about commutativity:<br>

<br>>>> x = [1, IGNORE(2), 3]<br>>>> x2 = x.copy()<br>>>> y = [10, 11, IGNORE(12)]<br>>>> z = x + y<br>>>> a = z.sum()<br>>>> x += y<br>>>> b = x.sum()<br>>>> y += x2<br>

>>> c = y.sum()<br><br>So, we should have:  a == b == c.<br>Additionally, if we allow users to unignore data, then we should have:<br><br>>>> x = [1, IGNORE(2), 3]<br>

>>> x2 = x.copy()<br>

>>> y = [10, 11, IGNORE(12)]<br>

>>> x += y<br>

>>> aa = unignore(x).sum()<br>

>>> y += x2<br>

>>> bb = unignore(y).sum()<br>

>>> aa == bb<br>True<br><br>Is there agreement on this?<br><br><br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">


> Also, how does something like this get handled?<br>

><br>

>>>> a = [1, 2, IGNORED(3), NaN]<br>

><br>

> If I were to say, "What is the mean of 'a'?", then I think most of the time<br>

> people would want 1.5.<br>

<br>

</div><br>I would want NaN! But that's because the only way I get NaN's is when<br>

I do dumb things like compute log(0), and again, I want my code to<br>

tell me that I was dumb instead of just quietly making up a<br>

meaningless answer.<br>

<br></blockquote><div><br>That's definitely field specific then.  In probability:  0 = 0 log(0) is a common idiom.  In NumPy, 0 log(0) gives NaN, so you'd want to ignore then when summing.<br></div></div><br>