[Numpy-discussion] missing data discussion round 2

Wed Jun 29 09:45:13 EDT 2011

Hi,

On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> ...
>> > (You might think, what difference does it make if you *can* unmask an
>> > item? Us missing data folks could just ignore this feature. But:
>> > whatever we end up implementing is something that I will have to
>> > explain over and over to different people, most of them not
>> > particularly sophisticated programmers. And there's just no sensible
>> > way to explain this idea that if you store some particular value, then
>> > it replaces the old value, but if you store NA, then the old value is
>> > still there.
>>
>> Ouch - yes.  No question, that is difficult to explain.   Well, I
>> think the explanation might go like this:
>>
>> "Ah, yes, well, that's because in fact numpy records missing values by
>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>
>> Is that fair?
>
> My favorite way of explaining it would be to have a grid of numbers written
> on paper, then have several cardboards with holes poked in them in different
> configurations. Placing these cardboard masks in front of the grid would
> show different sets of non-missing data, without affecting the values stored
> on the paper behind them.

Right - but here of course you are trying to explain the mask, and
this is Nathaniel's point, that in order to explain NAs, you have to
explain masks, and so, even at a basic level, the fusion of the two
ideas is obvious, and already confusing.  I mean this:

a[3] = np.NA

"Oh, so you just set the a[3] value to have some missing value code?"

"Ah - no - in fact what I did was set a associated mask in position
a[3] so that you can't any longer see the previous value of a[3]"

"Huh.  You mean I have a mask for every single value in order to be
able to blank out a[3]?  It looks like an assignment.  I mean, it
looks just like a[3] = 4.  But I guess it isn't?"

"Er..."

I think Nathaniel's point is a very good one - these are separate
ideas, np.NA and np.IGNORE, and a joint implementation is bound to
draw them together in the mind of the user.    Apart from anything
else, the user has to know that, if they want a single NA value in an
array, they have to add a mask size array.shape in bytes.  They have
to know then, that NA is implemented by masking, and then the 'NA for
free by adding masking' idea breaks down and starts to feel like a
kludge.

The counter argument is of course that, in time, the implementation of
NA with masking will seem as obvious and intuitive, as, say,
broadcasting, and that we are just reacting from lack of experience
with the new API.

Of course, that does happen, but here, unless I am mistaken, the
primary drive to fuse NA and masking is because of ease of
implementation.   That doesn't necessarily mean that they don't go
together - if something is easy to implement, sometimes it means it
will also feel natural in use, but at least we might say that there is
some risk of the implementation driving the API, and that that can
lead to problems.

See you,

Matthew