[Numpy-discussion] missing data discussion round 2

Wed Jun 29 15:35:35 EDT 2011

Oops,

On Wed, Jun 29, 2011 at 8:32 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> On Wed, Jun 29, 2011 at 8:20 AM, Lluís <xscript at gmx.net> wrote:
>>>
>>> Matthew Brett writes:
>>>
>>> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
>>> >> the idea that the entry is still there, but we're just ignoring it.  Of
>>> >> course, that goes against common convention, but it might be easier to
>>> >> explain.
>>>
>>> > I think Nathaniel's point is that np.IGNORE is a different idea than
>>> > np.NA, and that is why joining the implementations can lead to
>>> > conceptual confusion.
>>>
>>> This is how I see it:
>>>
>>> >>> a = np.array([0, 1, 2], dtype=int)
>>> >>> a[0] = np.NA
>>> ValueError
>>> >>> e = np.array([np.NA, 1, 2], dtype=int)
>>> ValueError
>>> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
>>> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
>>> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
>>> >>> b[1] = np.NA
>>> >>> np.sum(b)
>>> np.NA
>>> >>> np.sum(b, skipna=True)
>>> 2
>>> >>> b.mask
>>> None
>>> >>> m[1] = np.NA
>>> >>> np.sum(m)
>>> 2
>>> >>> np.sum(m, skipna=True)
>>> 2
>>> >>> m.mask
>>> [False, False, True]
>>> >>> bm[1] = np.NA
>>> >>> np.sum(bm)
>>> 2
>>> >>> np.sum(bm, skipna=True)
>>> 2
>>> >>> bm.mask
>>> [False, False, True]
>>>
>>> So:
>>>
>>> * Mask takes precedence over bit pattern on element assignment. There's
>>>  still the question of how to assign a bit pattern NA when the mask is
>>>  active.
>>>
>>> * When using mask, elements are automagically skipped.
>>>
>>> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>>>
>>> * When using bit pattern + mask, it might make sense to have the initial
>>>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>>>  False, True]" and "np.sum(bm) == np.NA")
>>
>> There seems to be a general idea that masks and NA bit patterns imply
>> particular differing semantics, something which I think is simply false.
>
> Well - first - it's helpful surely to separate the concepts and the
> implementation.
>
> Concepts / use patterns (as delineated by Nathaniel):
> A) missing values == 'np.NA' in my emails.  Can we call that CMV
> (concept missing values)?
> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>
> Implementations
> 1) bit-pattern == na-dtype - how about we call that IBP
> (implementation bit patten)?
> 2) array.mask.  IM (implementation mask)?
>
> Nathaniel implied that:
>
> CMV implies: sum([np.NA, 1]) == np.NA
> CMSK implies sum([np.NA, 1]) == 1
>
> and indeed, that's how R and masked arrays respectively behave.  So I
> think it's reasonable to say that at least R thought that the bitmask
> implied the first and Pierre and others thought the mask meant the
> second.
>
> The NEP as it stands thinks of CMV and and CM as being different views
> of the same thing,   Please correct me if I'm wrong.
>
>> Both NaN and Inf are implemented in hardware with the same idea as the NA
>> bit pattern, but they do not follow NA missing value semantics.
>
> Right - and that doesn't affect the argument, because the argument is
> about the concepts and not the implementation.
>
>> As far as I can tell, the only required difference between them is that NA
>> bit patterns must destroy the data. Nothing else.
>
> I think Nathaniel's point was about the expected default behavior in
> the different concepts.
>
>> Everything on top of that
>> is a choice of API and interface mechanisms. I want them to behave exactly
>> the same except for that necessary difference, so that it will be possible
>> to use the *exact same Python code* with either approach.
>
> Right.  And Nathaniel's point is that that desire leads to fusion of
> the two ideas into one when they should be separated.  For example, if
> I understand correctly:
>
>>>> a = np.array([1.0, 2.0, 3, 7.0], masked=True)
>>>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>>>> a[3] = np.NA  # actual real hand-on-heart assignment
>>>> b[3] = np.NA # magic mask setting although it looks the same

I meant:

>>> a = np.array([1.0, 2.0, 3.0, 7.0], masked=True)
>>> b = np.array([1.0, 2.0, 3.0, 7.0], dtype='NA[f8]')
>>> b[3] = np.NA  # actual real hand-on-heart assignment
>>> a[3] = np.NA # magic mask setting although it looks the same

Sorry,

Matthew