[Numpy-discussion] An NA compromise idea -- many-NA

Fri Jul 1 16:49:39 EDT 2011

On Fri, Jul 1, 2011 at 2:42 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:

> On Fri, Jul 1, 2011 at 3:36 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>> On Fri, Jul 1, 2011 at 2:33 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>
>>> On Fri, Jul 1, 2011 at 3:29 PM, Charles R Harris <
>>> charlesr.harris at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Jul 1, 2011 at 2:26 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>>>
>>>>> On Fri, Jul 1, 2011 at 3:20 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>>>>
>>>>>> On Fri, Jul 1, 2011 at 3:01 PM, Skipper Seabold <jsseabold at gmail.com>wrote:
>>>>>>
>>>>>>> On Fri, Jul 1, 2011 at 3:46 PM, Dag Sverre Seljebotn
>>>>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>>> > I propose a simple idea *for the long term* for generalizing Mark's
>>>>>>> > proposal, that I hope may perhaps put some people behind Mark's
>>>>>>> concrete
>>>>>>> > proposal in the short term.
>>>>>>> >
>>>>>>> > If key feature missing in Mark's proposal is the ability to
>>>>>>> distinguish
>>>>>>> > between different reason for NA-ness; IGNORE vs. NA. However, one
>>>>>>> could
>>>>>>> > conceive wanting to track a whole host of reasons:
>>>>>>> >
>>>>>>> > homework_grades = np.asarray([2, 3, 1, EATEN_BY_DOG, 5, SICK, 2,
>>>>>>> TOO_LAZY])
>>>>>>> >
>>>>>>> > Wouldn't it be a shame to put a lot of work into NA, but then have
>>>>>>> users
>>>>>>> > to still keep a seperate "shadow-array" for stuff like this?
>>>>>>> >
>>>>>>> > a) In this case the generality of Mark's proposal seems justified
>>>>>>> and
>>>>>>> > less confusing to teach newcomers (?)
>>>>>>> >
>>>>>>> > b) Since Mark's proposal seems to generalize well to many NAs
>>>>>>> (there's 8
>>>>>>> > bits in the mask, and millions of available NaN-s in floating
>>>>>>> point), if
>>>>>>> > people agreed to this one could leave it for later and just go on
>>>>>>> with
>>>>>>> > the proposed idea.
>>>>>>> >
>>>>>>>
>>>>>>> I have not been following the discussion in much detail, so forgive
>>>>>>> me
>>>>>>> if this has come up. But I think this approach is also similar to
>>>>>>> thinking behind missing values in SAS and "extended" missing values
>>>>>>> in
>>>>>>> Stata. They are missing but preserve an order. This way you can pull
>>>>>>> out values that are missing because they were eaten by a dog and see
>>>>>>> if these missing ones are systematically different than the ones that
>>>>>>> are missing because they're too lazy. Use case that pops to mind,
>>>>>>> seeing if the various ways of attrition in surveys or experiments
>>>>>>> varies in a non-random way.
>>>>>>>
>>>>>>>
>>>>>>> http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000989180.htm
>>>>>>> http://www.stata.com/help.cgi?missing
>>>>>>
>>>>>>
>>>>>> That's interesting, and I see that they use a numerical ordering for
>>>>>> the different NA values. I think if instead of using the AND operator to
>>>>>> combine masks, we use MINIMUM, this behavior would happen naturally with
>>>>>> almost no additional work. Then, in addition to np.NA and np.NA(dtype), it
>>>>>> could allow np.NA(dtype, ID) to assign an ID between 1 and 255, where 1 is
>>>>>> the default.
>>>>>>
>>>>>
>>>>> Sorry, my brain is a bit addled by all these comments. This idea would
>>>>> also require flipping the mask so 0 is unmasked. and 1 to 255 is masked as
>>>>> Christopher pointed out in a different thread.
>>>>>
>>>>
>>>> Or you could subtract instead of add and use maximum instead of minimum.
>>>> I thought those details would be hidden.
>>>>
>>>
>>> Definitely, but the most natural distinction thinking numerically is
>>> between zero and non-zero, and there's only one zero, so giving it the
>>> 'unmasked' value is natural for this way of extending it. If you follow
>>> Joe's idea where you're basically introducing it as an image alpha mask, you
>>> would have 0 be fully masked, 128 be 50% masked, and 255 be fully unmasked.
>>>
>>>
>> I'm not complaining ;) I thought these ideas were out there from the
>> beginning, but maybe that was just me...
>>
>
> You're right, but it feels like it's been 10 years in internet time by now.
> :)
>
> The design has evolved a lot from all the feedback too, so revisiting some
> of these things that initially may have felt less like they fit before
> doesn't hurt. I'm not so keen on rereading 250+ email messages though...
>
>
I wouldn't worry about it too much. You chose masks as one of the
fundamental options because of their generality and this is one of the
consequences of that generality. I was also thinking about this in terms of
Pierre's soft/hard mask distinction, I don't know about the shared mask
thing.

Several questions that have also been floating about in my mind are these.
Can you mask an array with NA values? can you mask a masked array with a
view?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110701/b6c8a3e6/attachment.html>