[Numpy-discussion] An NA compromise idea -- many-NA

Fri Jul 1 16:26:30 EDT 2011

On Fri, Jul 1, 2011 at 3:20 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:

> On Fri, Jul 1, 2011 at 3:01 PM, Skipper Seabold <jsseabold at gmail.com>wrote:
>
>> On Fri, Jul 1, 2011 at 3:46 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>> > I propose a simple idea *for the long term* for generalizing Mark's
>> > proposal, that I hope may perhaps put some people behind Mark's concrete
>> > proposal in the short term.
>> >
>> > If key feature missing in Mark's proposal is the ability to distinguish
>> > between different reason for NA-ness; IGNORE vs. NA. However, one could
>> > conceive wanting to track a whole host of reasons:
>> >
>> > homework_grades = np.asarray([2, 3, 1, EATEN_BY_DOG, 5, SICK, 2,
>> TOO_LAZY])
>> >
>> > Wouldn't it be a shame to put a lot of work into NA, but then have users
>> > to still keep a seperate "shadow-array" for stuff like this?
>> >
>> > a) In this case the generality of Mark's proposal seems justified and
>> > less confusing to teach newcomers (?)
>> >
>> > b) Since Mark's proposal seems to generalize well to many NAs (there's 8
>> > bits in the mask, and millions of available NaN-s in floating point), if
>> > people agreed to this one could leave it for later and just go on with
>> > the proposed idea.
>> >
>>
>> I have not been following the discussion in much detail, so forgive me
>> if this has come up. But I think this approach is also similar to
>> thinking behind missing values in SAS and "extended" missing values in
>> Stata. They are missing but preserve an order. This way you can pull
>> out values that are missing because they were eaten by a dog and see
>> if these missing ones are systematically different than the ones that
>> are missing because they're too lazy. Use case that pops to mind,
>> seeing if the various ways of attrition in surveys or experiments
>> varies in a non-random way.
>>
>>
>> http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000989180.htm
>> http://www.stata.com/help.cgi?missing
>
>
> That's interesting, and I see that they use a numerical ordering for the
> different NA values. I think if instead of using the AND operator to combine
> masks, we use MINIMUM, this behavior would happen naturally with almost no
> additional work. Then, in addition to np.NA and np.NA(dtype), it could allow
> np.NA(dtype, ID) to assign an ID between 1 and 255, where 1 is the default.
>

Sorry, my brain is a bit addled by all these comments. This idea would also
require flipping the mask so 0 is unmasked. and 1 to 255 is masked as
Christopher pointed out in a different thread.

-Mark

>
> -Mark
>
>
>>
>>
>> Maybe this is neither here nor there, I just don't want to end up with
>> the R way is the only way. That's why I prefer Python :)
>>
>> Skipper
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110701/72cf2059/attachment.html>