[Numpy-discussion] An NA compromise idea -- many-NA

Mark Wiebe mwwiebe at gmail.com
Fri Jul 1 16:33:34 EDT 2011


On Fri, Jul 1, 2011 at 3:29 PM, Charles R Harris
<charlesr.harris at gmail.com>wrote:

>
>
> On Fri, Jul 1, 2011 at 2:26 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>
>> On Fri, Jul 1, 2011 at 3:20 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>
>>> On Fri, Jul 1, 2011 at 3:01 PM, Skipper Seabold <jsseabold at gmail.com>wrote:
>>>
>>>> On Fri, Jul 1, 2011 at 3:46 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>> > I propose a simple idea *for the long term* for generalizing Mark's
>>>> > proposal, that I hope may perhaps put some people behind Mark's
>>>> concrete
>>>> > proposal in the short term.
>>>> >
>>>> > If key feature missing in Mark's proposal is the ability to
>>>> distinguish
>>>> > between different reason for NA-ness; IGNORE vs. NA. However, one
>>>> could
>>>> > conceive wanting to track a whole host of reasons:
>>>> >
>>>> > homework_grades = np.asarray([2, 3, 1, EATEN_BY_DOG, 5, SICK, 2,
>>>> TOO_LAZY])
>>>> >
>>>> > Wouldn't it be a shame to put a lot of work into NA, but then have
>>>> users
>>>> > to still keep a seperate "shadow-array" for stuff like this?
>>>> >
>>>> > a) In this case the generality of Mark's proposal seems justified and
>>>> > less confusing to teach newcomers (?)
>>>> >
>>>> > b) Since Mark's proposal seems to generalize well to many NAs (there's
>>>> 8
>>>> > bits in the mask, and millions of available NaN-s in floating point),
>>>> if
>>>> > people agreed to this one could leave it for later and just go on with
>>>> > the proposed idea.
>>>> >
>>>>
>>>> I have not been following the discussion in much detail, so forgive me
>>>> if this has come up. But I think this approach is also similar to
>>>> thinking behind missing values in SAS and "extended" missing values in
>>>> Stata. They are missing but preserve an order. This way you can pull
>>>> out values that are missing because they were eaten by a dog and see
>>>> if these missing ones are systematically different than the ones that
>>>> are missing because they're too lazy. Use case that pops to mind,
>>>> seeing if the various ways of attrition in surveys or experiments
>>>> varies in a non-random way.
>>>>
>>>>
>>>> http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000989180.htm
>>>> http://www.stata.com/help.cgi?missing
>>>
>>>
>>> That's interesting, and I see that they use a numerical ordering for the
>>> different NA values. I think if instead of using the AND operator to combine
>>> masks, we use MINIMUM, this behavior would happen naturally with almost no
>>> additional work. Then, in addition to np.NA and np.NA(dtype), it could allow
>>> np.NA(dtype, ID) to assign an ID between 1 and 255, where 1 is the default.
>>>
>>
>> Sorry, my brain is a bit addled by all these comments. This idea would
>> also require flipping the mask so 0 is unmasked. and 1 to 255 is masked as
>> Christopher pointed out in a different thread.
>>
>
> Or you could subtract instead of add and use maximum instead of minimum. I
> thought those details would be hidden.
>

Definitely, but the most natural distinction thinking numerically is between
zero and non-zero, and there's only one zero, so giving it the 'unmasked'
value is natural for this way of extending it. If you follow Joe's idea
where you're basically introducing it as an image alpha mask, you would have
0 be fully masked, 128 be 50% masked, and 255 be fully unmasked.

-Mark


>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110701/bd2c285c/attachment.html>


More information about the NumPy-Discussion mailing list