[Numpy-discussion] alterNEP - was: missing data discussion round 2

Thu Jun 30 10:26:31 EDT 2011

On 06/30/2011 04:17 PM, Charles R Harris wrote:
>
>
> On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett at gmail.com
> <mailto:matthew.brett at gmail.com>> wrote:
>
>     Hi,
>
>     On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs at pobox.com
>     <mailto:njs at pobox.com>> wrote:
>      > Anyway, it's pretty clear that in this particular case, there are two
>      > distinct features that different people want: the missing data
>      > feature, and the masked array feature. The more I think about it, the
>      > less I see how they can be combined into one dessert topping + floor
>      > wax solution. Here are three particular points where they seem to
>      > contradict each other:
>     ...
>     [some proposals]
>
>     In the interest of making the discussion as concrete as possible, here
>     is my draft of an alternative proposal for NAs and masking, based on
>     Nathaniel's comments.  Writing it, it seemed to me that Nathaniel is
>     right, that the ideas become much clearer when the NA idea and the
>     MASK idea are separate.   Please do pitch in for things I may have
>     missed or misunderstood:
>
>     ###############################################
>     A alternative-NEP on masking and missing values
>     ###############################################
>
>     The principle of this aNEP is to separate the APIs for masking and
>     for missing
>     values, according to
>
>     * The current implementation of masked arrays
>     * Nathaniel Smith's proposal.
>
>     This discussion is only of the API, and not of the implementation.
>
>     **************
>     Initialization
>     **************
>
>     First, missing values can be set and be displayed as ``np.NA, NA``::
>
>      >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>         array([1., 2., NA, 7.], dtype='NA[<f8]')
>
>     As the initialization is not ambiguous, this can be written without
>     the NA
>     dtype::
>
>      >>> np.array([1.0, 2.0, np.NA, 7.0])
>         array([1., 2., NA, 7.], dtype='NA[<f8]')
>
>     Masked values can be set and be displayed as ``np.MASKED, MASKED``::
>
>      >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
>         array([1., 2., MASKED, 7.], masked=True)
>
>     As the initialization is not ambiguous, this can be written without
>     ``masked=True``::
>
>      >>> np.array([1.0, 2.0, np.MASKED, 7.0])
>         array([1., 2., MASKED, 7.], masked=True)
>
>     ******
>     Ufuncs
>     ******
>
>     By default, NA values propagate::
>
>      >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
>      >>> np.sum(na_arr)
>         NA('float64')
>
>     unless the ``skipna`` flag is set::
>
>      >>> np.sum(na_arr, skipna=True)
>         10.0
>
>     By default, masking does not propagate::
>
>      >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
>      >>> np.sum(masked_arr)
>         10.0
>
>     unless the ``propmsk`` flag is set::
>
>      >>> np.sum(masked_arr, propmsk=True)
>         MASKED
>
>     An array can be masked, and contain NA values::
>
>      >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])
>
>     In the default case, the behavior is obvious::
>
>      >>> np.sum(both_arr)
>         NA('float64')
>
>     It's also obvious what to do with ``skipna=True``::
>
>      >>> np.sum(both_arr, skipna=True)
>         10.0
>      >>> np.sum(both_arr, skipna=True, propmsk=True)
>         MASKED
>
>     To break the tie between NA and MSK, NAs propagate harder::
>
>      >>> np.sum(both_arr, propmsk=True)
>         NA('float64')
>
>     **********
>     Assignment
>     **********
>
>     is obvious in the NA case::
>
>      >>> arr = np.array([1.0, 2.0, 7.0])
>      >>> arr[2] = np.NA
>         TypeError('dtype does not support NA')
>      >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
>      >>> na_arr[2] = np.NA
>      >>> na_arr
>         array([1., 2., NA], dtype='NA[<f8]')
>
>     Direct assignnent in the masked case is magic and confusing, and so
>     happens only
>     via the mask::
>
>      >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>      >>> masked_arr[2] = np.NA
>         TypeError('dtype does not support NA')
>      >>> masked_arr[2] = np.MASKED
>         TypeError('float() argument must be a string or a number')
>      >>> masked_arr.visible[2] = False
>      >>> masked_arr
>         array([1., 2., MASKED], masked=True)
>
>     See y'all,
>
>
> I honestly don't see the problem here. The difference isn't between
> masked_values/missing_values, it is between masked arrays and masked
> views of unmasked arrays. I think the view concept is central to what is
> going on. It may not be what folks are used to, but it strikes me as a
> clarifying advance rather than a mixed up confusion. Admittedly, it
> depends on the numpy centric ability to have views, but views are a
> wonderful thing.

So a) how do you propose that reductions behave?, b) what semantics for 
the []= operator do you propose?

That would clarify why you don't see a problem..

Dag Sverre