[Numpy-discussion] missing data discussion round 2

Pierre GM pgmdevlist at gmail.com
Wed Jun 29 13:05:22 EDT 2011


Matthew, Dag, +1.
On Jun 29, 2011 4:35 PM, "Dag Sverre Seljebotn" <d.s.seljebotn at astro.uio.no>
wrote:
> On 06/29/2011 03:45 PM, Matthew Brett wrote:
>> Hi,
>>
>> On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<mwwiebe at gmail.com> wrote:
>>> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett<matthew.brett at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com> wrote:
>>>> ...
>>>>> (You might think, what difference does it make if you *can* unmask an
>>>>> item? Us missing data folks could just ignore this feature. But:
>>>>> whatever we end up implementing is something that I will have to
>>>>> explain over and over to different people, most of them not
>>>>> particularly sophisticated programmers. And there's just no sensible
>>>>> way to explain this idea that if you store some particular value, then
>>>>> it replaces the old value, but if you store NA, then the old value is
>>>>> still there.
>>>>
>>>> Ouch - yes. No question, that is difficult to explain. Well, I
>>>> think the explanation might go like this:
>>>>
>>>> "Ah, yes, well, that's because in fact numpy records missing values by
>>>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is,
>>>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>>>
>>>> Is that fair?
>>>
>>> My favorite way of explaining it would be to have a grid of numbers
written
>>> on paper, then have several cardboards with holes poked in them in
different
>>> configurations. Placing these cardboard masks in front of the grid would
>>> show different sets of non-missing data, without affecting the values
stored
>>> on the paper behind them.
>>
>> Right - but here of course you are trying to explain the mask, and
>> this is Nathaniel's point, that in order to explain NAs, you have to
>> explain masks, and so, even at a basic level, the fusion of the two
>> ideas is obvious, and already confusing. I mean this:
>>
>> a[3] = np.NA
>>
>> "Oh, so you just set the a[3] value to have some missing value code?"
>>
>> "Ah - no - in fact what I did was set a associated mask in position
>> a[3] so that you can't any longer see the previous value of a[3]"
>>
>> "Huh. You mean I have a mask for every single value in order to be
>> able to blank out a[3]? It looks like an assignment. I mean, it
>> looks just like a[3] = 4. But I guess it isn't?"
>>
>> "Er..."
>>
>> I think Nathaniel's point is a very good one - these are separate
>> ideas, np.NA and np.IGNORE, and a joint implementation is bound to
>> draw them together in the mind of the user. Apart from anything
>> else, the user has to know that, if they want a single NA value in an
>> array, they have to add a mask size array.shape in bytes. They have
>> to know then, that NA is implemented by masking, and then the 'NA for
>> free by adding masking' idea breaks down and starts to feel like a
>> kludge.
>>
>> The counter argument is of course that, in time, the implementation of
>> NA with masking will seem as obvious and intuitive, as, say,
>> broadcasting, and that we are just reacting from lack of experience
>> with the new API.
>
> However, no matter how used we get to this, people coming from almost
> any other tool (in particular R) will keep think it is
> counter-intuitive. Why set up a major semantic incompatability that
> people then have to overcome in order to start using NumPy.
>
> I really don't see what's wrong with some more explicit API like
> a.mask[3] = True. "Explicit is better than implicit".
>
> Dag Sverre
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110629/5cf9f299/attachment.html>


More information about the NumPy-Discussion mailing list