[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 17:56:14 EDT 2011

On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root <ben.root at ou.edu> wrote:
>> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> I guess that is a difference, but I'm trying to get at something more
>>> fundamental -- not just what operations are allowed, but what
>>> operations people *expect* to be allowed.
>>
>> That is quite a trickier problem.
>
> It can be. I think of it as the difference between design and coding.
> They overlap less than one might expect...
>
>>> Here's another possible difference -- in (1), intuitively, missingness
>>> is a property of the data, so the logical place to put information
>>> about whether you can expect missing values is in the dtype, and to
>>> enable missing values you need to make a new array with a new dtype.
>>> (If we use a mask-based implementation, then
>>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
>>> to skip making a copy of the data -- I'm talking ONLY about the
>>> interface here, not whether missing data has a different storage
>>> format from non-missing data.)
>>>
>>> In (2), the whole point is to use different masks with the same data,
>>> so I'd argue masking should be a property of the array object rather
>>> than the dtype, and the interface should logically allow masks to be
>>> created, modified, and destroyed in place.
>>>
>>
>> I can agree with this distinction.  However, if "missingness" is an
>> intrinsic property of the data, then shouldn't users be implementing their
>> own dtype tailored to the data they are using?  In other words, how far does
>> the core of NumPy need to go to address this issue?  And how far would be
>> "too much"?
>
> Yes, that's exactly my question: whether our goal is to implement
> missingness in numpy or not!
>
>>>
>>> They're both internally consistent, but I think we might have to make
>>> a decision and stick to it.
>>>
>>
>> Of course.  I think that Mark is having a very inspired idea of giving the R
>> audience what they want (np.NA), while simultaneously making the use of
>> masked arrays even easier (which I can certainly appreciate).
>
> I don't know. I think we could build a really top-notch implementation
> of missingness. I also think we could build a really top-notch
> implementation of masking. But my suggestions for how to improve the
> current design are totally different depending on which of those is
> the goal, and neither the R audience (like me) nor the masked array
> audience (like you) seems really happy with the current design. And I
> don't know what the goal is -- maybe it's something else and the
> current design hits it perfectly? Maybe we want a top-notch
> implementation of *both* missingness and masking, and those should be
> two different things that can be combined, so that some of the
> unmasked values inside a masked array can be NA? I don't know.
>
>> I will put out a little disclaimer.  I once had to use S+ for a class.  To
>> be honest, it was the worst programming experience in my life.  This
>> experience may be coloring my perception of R's approach to handling missing
>> data.
>
> There's a lot of things that R does wrong (not their fault; language
> design is an extremely difficult and specialized skill, that
> statisticians are not exactly trained in), but it did make a few
> excellent choices at the beginning. One was to steal the execution
> model from Scheme, which, uh, isn't really relevant here. The other
> was to steal the basic data types and standard library that the Bell
> Labs statisticians had pounded into shape over many years. I use
> Python now because using R for everything would drive me crazy, but
> despite its many flaws, it still does some things so well that it's
> become *the* language used for basically all statistical research. I'm
> only talking about stealing those things :-).
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

+1. Everyone knows R ain't perfect. I think it's an atrociously bad
programming language but it can be unbelievably good at statistics, as
evidenced by its success. Brings to mind Andy Gelman's blog last fall:

http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

As someone in a statistics department I've frequently been
disheartened when I see how easy many statistical things are in R and
how much more difficult they are in Python. This is partially the
result of poor interfaces for statistical modeling, partially due to
data structures (e.g. the integrated-ness of data.frame throughout R)
and things like handling of missing data of which there's currently no
equivalent.

- Wes