[Numpy-discussion] missing data discussion round 2

Thu Jun 30 11:06:24 EDT 2011

On Wed, Jun 29, 2011 at 2:32 PM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Wed, Jun 29, 2011 at 8:20 AM, Lluís <xscript at gmx.net> wrote:
> >>
> >> Matthew Brett writes:
> >>
> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> >> >> the idea that the entry is still there, but we're just ignoring it.
>  Of
> >> >> course, that goes against common convention, but it might be easier
> to
> >> >> explain.
> >>
> >> > I think Nathaniel's point is that np.IGNORE is a different idea than
> >> > np.NA, and that is why joining the implementations can lead to
> >> > conceptual confusion.
> >>
> >> This is how I see it:
> >>
> >> >>> a = np.array([0, 1, 2], dtype=int)
> >> >>> a[0] = np.NA
> >> ValueError
> >> >>> e = np.array([np.NA, 1, 2], dtype=int)
> >> ValueError
> >> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
> >> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
> >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
> >> >>> b[1] = np.NA
> >> >>> np.sum(b)
> >> np.NA
> >> >>> np.sum(b, skipna=True)
> >> 2
> >> >>> b.mask
> >> None
> >> >>> m[1] = np.NA
> >> >>> np.sum(m)
> >> 2
> >> >>> np.sum(m, skipna=True)
> >> 2
> >> >>> m.mask
> >> [False, False, True]
> >> >>> bm[1] = np.NA
> >> >>> np.sum(bm)
> >> 2
> >> >>> np.sum(bm, skipna=True)
> >> 2
> >> >>> bm.mask
> >> [False, False, True]
> >>
> >> So:
> >>
> >> * Mask takes precedence over bit pattern on element assignment. There's
> >>  still the question of how to assign a bit pattern NA when the mask is
> >>  active.
> >>
> >> * When using mask, elements are automagically skipped.
> >>
> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
> >>
> >> * When using bit pattern + mask, it might make sense to have the initial
> >>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
> >>  False, True]" and "np.sum(bm) == np.NA")
> >
> > There seems to be a general idea that masks and NA bit patterns imply
> > particular differing semantics, something which I think is simply false.
>
> Well - first - it's helpful surely to separate the concepts and the
> implementation.
>
> Concepts / use patterns (as delineated by Nathaniel):
> A) missing values == 'np.NA' in my emails.  Can we call that CMV
> (concept missing values)?
> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>

This is a different conceptual model than I'm proposing in the NEP. This is
also exactly what I was trying to clarify in the first email in this thread
under the headings "Missing Data Abstraction" and "Implementation
Techniques". Masks are *just* an implementation technique. They imply
nothing more, except through previously established conventions such as in
various bitmasks, image masks, numpy.ma and others.

masks != np.IGNORE
bit patterns != np.NA

Masks vs bit patterns and R's default NA vs rm.na NA semantics are
completely independent, except where design choices are made that they
should be related. I think they should be unrelated, masks and bit patterns
are two approaches to solving the same problem.

>
> Implementations
> 1) bit-pattern == na-dtype - how about we call that IBP
> (implementation bit patten)?
> 2) array.mask.  IM (implementation mask)?
>
> Nathaniel implied that:
>
> CMV implies: sum([np.NA, 1]) == np.NA
> CMSK implies sum([np.NA, 1]) == 1
>
> and indeed, that's how R and masked arrays respectively behave.

R and numpy.ma.  If we're trying to be clear about our concepts and
implementations, numpy.ma is just one possible implementation of masked
arrays.

> So I
> think it's reasonable to say that at least R thought that the bitmask
> implied the first and Pierre and others thought the mask meant the
> second.
>

R's model is based on years of experience and a model of what missing values
implies, the bitmask implies nothing about the behavior of NA.

>
> The NEP as it stands thinks of CMV and and CM as being different views
> of the same thing,   Please correct me if I'm wrong.
>
> > Both NaN and Inf are implemented in hardware with the same idea as the NA
> > bit pattern, but they do not follow NA missing value semantics.
>
> Right - and that doesn't affect the argument, because the argument is
> about the concepts and not the implementation.
>

You just said R thought bitmasks implied something, and you're saying masked
arrays imply something. If the argument is just about the missing value
concepts, neither of these should be in the present discussion.

>
> > As far as I can tell, the only required difference between them is that
> NA
> > bit patterns must destroy the data. Nothing else.
>
> I think Nathaniel's point was about the expected default behavior in
> the different concepts.
>
> > Everything on top of that
> > is a choice of API and interface mechanisms. I want them to behave
> exactly
> > the same except for that necessary difference, so that it will be
> possible
> > to use the *exact same Python code* with either approach.
>
> Right.  And Nathaniel's point is that that desire leads to fusion of
> the two ideas into one when they should be separated.  For example, if
> I understand correctly:
>
> >>> a = np.array([1.0, 2.0, 3, 7.0], masked=True)
> >>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
> >>> a[3] = np.NA  # actual real hand-on-heart assignment
> >>> b[3] = np.NA # magic mask setting although it looks the same
>

Why is one "magic" and the other "real"? All of this is already sitting on
100 layers of abstraction above electrons and atoms. If we're talking about
"real," maybe we should be programming in machine code or using breadboards
with individual transistors.

>
> > Say you're using NA dtypes, and suddenly you think, "what if I
> temporarily
> > treated these as NA too". Now you have to copy your whole array to avoid
> > destroying your data! The NA bit pattern didn't save you memory here...
> Say
> > you're using masks, and it turns out you didn't actually need masking
> > semantics. If they're different, you now have to do lots of code changes
> to
> > switch to NA dtypes!
>
> I personally have not run across that case.  I'd imagine that, if you
> knew you wanted to do something so explicitly masking-like, you'd
> start with the masking interface.
>

People's use cases change over time, and sometimes one person's code is
useful for others. I'd prefer to let people share.

Clearly there are some overlaps between what masked arrays are trying
> to achieve and what Rs NA mechanisms are trying to achieve.  Are they
> really similar enough that they should function using the same API?
>

Yes.

> And if so, won't that be confusing?

No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already
confusing.

-Mark

> I think that's the question
> that's being asked.
>
> See you,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110630/866e9ff9/attachment.html>