[Numpy-discussion] alterNEP - was: missing data discussion round 2

Thu Jun 30 12:30:06 EDT 2011

Hi,

On Thu, Jun 30, 2011 at 5:03 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>
> On Jun 30, 2011, at 5:38 PM, Matthew Brett wrote:
>
>> Hi,
>>
>> On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>>>
>>> On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote:
>>>> ###############################################
>>>> A alternative-NEP on masking and missing values
>>>> ###############################################
>>>
>>> I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way.
>>>
>>>
>>>> **************
>>>> Initialization
>>>> **************
>>>>
>>>> First, missing values can be set and be displayed as ``np.NA, NA``::
>>>>
>>>>>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>>>>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>>>>
>>>> As the initialization is not ambiguous, this can be written without the NA
>>>> dtype::
>>>>
>>>>>>> np.array([1.0, 2.0, np.NA, 7.0])
>>>>    array([1., 2., NA, 7.], dtype='NA[<f8]')
>>>>
>>>> Masked values can be set and be displayed as ``np.MASKED, MASKED``::
>>>>
>>>>>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
>>>>    array([1., 2., MASKED, 7.], masked=True)
>>>>
>>>> As the initialization is not ambiguous, this can be written without
>>>> ``masked=True``::
>>>>
>>>>>>> np.array([1.0, 2.0, np.MASKED, 7.0])
>>>>    array([1., 2., MASKED, 7.], masked=True)
>>>
>>> I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here.
>>
>> If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then
>> obviously I mean it should be masked, so the 'masked=True' here is
>> completely redundant, yes, I agree.  And in fact:
>>
>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=False)
>>
>> should raise an error.  On the other hand, if I make a normal array:
>>
>> arr = np.array([1.0, 2.0, 7.0])
>>
>> and then do this:
>>
>> arr.visible[2] = False
>>
>> then either I should raise an error (it's not a masked array), or,
>> more magically, construct a mask on the fly.   This somewhat breaks
>> expectations though, because you might just have made a largish mask
>> array without having any clue that that had happened.
>
> Well, I'd expect an error to be raised when assigning a NA if the initial array is not NA friendly. The 'magical' creation of a mask would be nice, but is probably too magic and best left alone.

I agree :)

>>>
>>>>
>>>> Direct assignnent in the masked case is magic and confusing, and so happens only
>>>> via the mask::
>>>>
>>>>>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>>>>>>> masked_arr[2] = np.NA
>>>>    TypeError('dtype does not support NA')
>>>>>>> masked_arr[2] = np.MASKED
>>>>    TypeError('float() argument must be a string or a number')
>>>>>>> masked_arr.visible[2] = False
>>>>>>> masked_arr
>>>>    array([1., 2., MASKED], masked=True)
>>>
>>> What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ?
>>
>> Well, for the np.NA case, this is straightforward:
>>
>> na_arr[2] = 3
>>
>> It's just assignment. For ``masked_array[2] = 3`` - I don't know, I
>> guess whatever we are used to.  What do you think?
>
> Ahah, that depends.
> With a = np.array([1., np.NA, 3.]), then a[1]=2. should raise an error, as Mark suggests: you can't "unmask" a missing value, you need to create a view of the initial array then "unmask". It's the equivalent of a hard mask.

In this alterNEP, the NAs and the masked values are completely
different.  So, if you do this:

a = np.array([1., np.NA, 3.])

then you've unambiguously asked for an array that can handle floats
and NAs, and that will be the NA[<f8] dtype by default.  You didn't
ask for a masked array, you asked for an array that can carry NAs.
You can't unmask an NA, because an NA isn't a masked value, it's an
NA.  So, if you do:

a[1] = 2

you just mean 'change the NA in position [1] to the value 2'.   Simple as that.

> With a = np.array([1., np.IGNORE, 3.]), then a[1]=2. should give np.array([1.,2.,3.]) and a.mask=[False,False,False]. That's a soft mask.

Sounds reasonable to me...

Cheers,

Matthew