On Fri, Oct 28, 2011 at 12:39 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,

On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root@ou.edu> wrote:
>
>
> On Thursday, October 27, 2011, Charles R Harris <charlesr.harris@gmail.com>
> wrote:
>>
>>
>> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <oliphant@enthought.com>
>> wrote:
>>>
>>> That is a pretty good explanation.   I find myself convinced by Matthew's
>>> arguments.    I think that being able to separate ABSENT from IGNORED is a
>>> good idea.   I also like being able to control SKIP and PROPAGATE (but I
>>> think the current implementation allows this already).
>>>
>>> What is the counter-argument to this proposal?
>>>
>>
>> What exactly do you find convincing? The current masks propagate by
>> default:
>>
>> In [1]: a = ones(5, maskna=1)
>>
>> In [2]: a[2] = NA
>>
>> In [3]: a
>> Out[3]: array([ 1.,  1.,  NA,  1.,  1.])
>>
>> In [4]: a + 1
>> Out[4]: array([ 2.,  2.,  NA,  2.,  2.])
>>
>> In [5]: a[2] = 10
>>
>> In [5]: a
>> Out[5]: array([  1.,   1.,  10.,   1.,   1.], maskna=True)
>>
>>
>> I don't see an essential difference between the implementation using masks
>> and one using bit patterns, the mask when attached to the original array
>> just adds a bit pattern by extending all the types by one byte, an approach
>> that easily extends to all existing and future types, which is why Mark went
>> that way for the first implementation given the time available. The masks
>> are hidden because folks wanted something that behaved more like R and also
>> because of the desire to combine the missing, ignore, and later possibly bit
>> patterns in a unified manner. Note that the pseudo assignment was also meant
>> to look like R. Adding true bit patterns to numpy isn't trivial and I
>> believe Mark was thinking of parametrized types for that.
>>
>> The main problems I see with masks are unified storage and possibly memory
>> use. The rest is just behavor and desired API and that can be adjusted
>> within the current implementation. There is nothing essentially masky about
>> masks.
>>
>> Chuck
>>
>>
>
> I  think chuck sums it up quite nicely.  The implementation detail about
> using mask versus bit patterns can still be discussed and addressed.
> Personally, I just don't see how parameterized dtypes would be easier to use
> than the pseudo assignment.
>
> The elegance of mark's solution was to consider the treatment of missing
> data in a unified manner.  This puts missing data in a more prominent spot
> for extension builders, which should greatly improve support throughout the
> ecosystem.

Are extension builders then required to use the numpy C API to get
their data?  Speaking as an extension builder, I would rather you gave
me the mask and the bitpattern information and let me do that myself.


Forgive me, I wasn't clear.  What I am speaking of is more about a typical human failing.  If a programmer for a module never encounters masked arrays, then when they code up a function to operate on numpy data, it is quite likely that they would never take it into consideration.  Notice the prolific use of "np.asarray()" even within the numpy codebase, which destroys masked arrays.

However, by making missing data support more integral into the core of numpy, then it is far more likely that a programmer would take it into consideration when designing their algorithm, or at least explicitly document that their module does not support missing data.  Both NEPs does this by making missing data front-and-center.  However, my belief is that Mark's approach is easier to comprehend and is cleaner.  Cleaner features means that it is more likely to be used.

 
> By letting there be a single missing data framework (instead of
> two) all that users need to figure out is when they want nan-like behavior
> (propagate) or to be more like masks (skip).  Numpy takes care of the rest.
>  There is a reason why I like using masked arrays because I don't have to
> use nansum in my library functions to guard against the possibility of
> receiving nans.  Duck-typing is a good thing.
>
> My argument against separating IGNORE and PROPAGATE is that it becomes too
> tempting to want to mix these in an array, but the desired behavior would
> likely become ambiguous..
>
> There is one other proplem that I just thought of that I don't think has
> been outlined in either NEP.  What if I perform an operation between an
> array set up with propagate NAs and an array with skip NAs?

These are explicitly covered in the alterNEP:

https://gist.github.com/1056379/


Sort of.  You speak of reduction operations for a single array with a mix of NA and IGNOREs.  I guess in that case, it wouldn't make a difference for element-wise operations between two arrays (plus adding the NAs propagate harder rule).  Although, what if skipna=True?  I guess I would feel better seeing explicit examples for different combinations of settings (plus, how would one set those for math operators?).  In this case, I have a problem with this mixed situation.  I would think that IGNORE + NA = IGNORE, because if you are skipping it, then it is skipped, regardless of the other side of the operator.  (precedence: a masked array summed against an array of NANs).

Looking back over Mark's NEP, I see he does cover the issue I am talking about: "The design of this NEP does not distinguish between NAs that come from an NA mask or NAs that come from an NA dtype. Both of these get treated equivalently in computations, with masks dominating over NA dtypes".  However, he goes on about the possibility of multi-NA being able to control the effects more directly.

Cheers,
Ben Root