[Numpy-discussion] missing data discussion round 2

Wed Jun 29 13:34:48 EDT 2011

On Wed, Jun 29, 2011 at 8:45 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> ...
> >> > (You might think, what difference does it make if you *can* unmask an
> >> > item? Us missing data folks could just ignore this feature. But:
> >> > whatever we end up implementing is something that I will have to
> >> > explain over and over to different people, most of them not
> >> > particularly sophisticated programmers. And there's just no sensible
> >> > way to explain this idea that if you store some particular value, then
> >> > it replaces the old value, but if you store NA, then the old value is
> >> > still there.
> >>
> >> Ouch - yes.  No question, that is difficult to explain.   Well, I
> >> think the explanation might go like this:
> >>
> >> "Ah, yes, well, that's because in fact numpy records missing values by
> >> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
> >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >>
> >> Is that fair?
> >
> > My favorite way of explaining it would be to have a grid of numbers
> written
> > on paper, then have several cardboards with holes poked in them in
> different
> > configurations. Placing these cardboard masks in front of the grid would
> > show different sets of non-missing data, without affecting the values
> stored
> > on the paper behind them.
>
> Right - but here of course you are trying to explain the mask, and
> this is Nathaniel's point, that in order to explain NAs, you have to
> explain masks, and so, even at a basic level, the fusion of the two
> ideas is obvious, and already confusing.  I mean this:
>
> a[3] = np.NA
>
> "Oh, so you just set the a[3] value to have some missing value code?"
>

I would answer "Yes, that's basically true." The abstraction works that way,
and there's no reason to confuse people with those implementation details
right off the bat. When you introduce a new user to floating point numbers,
it would seem odd to first point out that addition isn't associative. That
kind of detail is important when you're learning more about the system and
digging deeper.

I think it was in a Knuth book that I read the idea that the best teaching
is a series of lies that successively correct the previous lies.

> "Ah - no - in fact what I did was set a associated mask in position
> a[3] so that you can't any longer see the previous value of a[3]"
>
> "Huh.  You mean I have a mask for every single value in order to be
> able to blank out a[3]?  It looks like an assignment.  I mean, it
> looks just like a[3] = 4.  But I guess it isn't?"
>
> "Er..."
>
> I think Nathaniel's point is a very good one - these are separate
> ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> draw them together in the mind of the user.

R jointly implements them with the rm.na=T parameter, and that's our model
system for missing data.

> Apart from anything
> else, the user has to know that, if they want a single NA value in an
> array, they have to add a mask size array.shape in bytes.  They have
> to know then, that NA is implemented by masking, and then the 'NA for
> free by adding masking' idea breaks down and starts to feel like a
> kludge.
>
> The counter argument is of course that, in time, the implementation of
> NA with masking will seem as obvious and intuitive, as, say,
> broadcasting, and that we are just reacting from lack of experience
> with the new API.
>

It will literally work the same as the implementation with NA dtypes, except
for the masking semantics which requires the extra steps of taking views.

>
> Of course, that does happen, but here, unless I am mistaken, the
> primary drive to fuse NA and masking is because of ease of
> implementation.

That's not the case, and I've tried to give a slightly better justification
for this in my answer Lluis' email.

> That doesn't necessarily mean that they don't go
> together - if something is easy to implement, sometimes it means it
> will also feel natural in use, but at least we might say that there is
> some risk of the implementation driving the API, and that that can
> lead to problems.
>

In the design process I'm doing, the implementation concerns are affecting
the interface concerns and vice versa, but the missing data semantics are
the main driver.

-Mark

>
> See you,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110629/844fc1c9/attachment.html>