[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 15:51:52 EDT 2011

On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root <ben.root at ou.edu> wrote:
> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> I guess that is a difference, but I'm trying to get at something more
>> fundamental -- not just what operations are allowed, but what
>> operations people *expect* to be allowed.
>
> That is quite a trickier problem.

It can be. I think of it as the difference between design and coding.
They overlap less than one might expect...

>> Here's another possible difference -- in (1), intuitively, missingness
>> is a property of the data, so the logical place to put information
>> about whether you can expect missing values is in the dtype, and to
>> enable missing values you need to make a new array with a new dtype.
>> (If we use a mask-based implementation, then
>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
>> to skip making a copy of the data -- I'm talking ONLY about the
>> interface here, not whether missing data has a different storage
>> format from non-missing data.)
>>
>> In (2), the whole point is to use different masks with the same data,
>> so I'd argue masking should be a property of the array object rather
>> than the dtype, and the interface should logically allow masks to be
>> created, modified, and destroyed in place.
>>
>
> I can agree with this distinction.  However, if "missingness" is an
> intrinsic property of the data, then shouldn't users be implementing their
> own dtype tailored to the data they are using?  In other words, how far does
> the core of NumPy need to go to address this issue?  And how far would be
> "too much"?

Yes, that's exactly my question: whether our goal is to implement
missingness in numpy or not!

>>
>> They're both internally consistent, but I think we might have to make
>> a decision and stick to it.
>>
>
> Of course.  I think that Mark is having a very inspired idea of giving the R
> audience what they want (np.NA), while simultaneously making the use of
> masked arrays even easier (which I can certainly appreciate).

I don't know. I think we could build a really top-notch implementation
of missingness. I also think we could build a really top-notch
implementation of masking. But my suggestions for how to improve the
current design are totally different depending on which of those is
the goal, and neither the R audience (like me) nor the masked array
audience (like you) seems really happy with the current design. And I
don't know what the goal is -- maybe it's something else and the
current design hits it perfectly? Maybe we want a top-notch
implementation of *both* missingness and masking, and those should be
two different things that can be combined, so that some of the
unmasked values inside a masked array can be NA? I don't know.

> I will put out a little disclaimer.  I once had to use S+ for a class.  To
> be honest, it was the worst programming experience in my life.  This
> experience may be coloring my perception of R's approach to handling missing
> data.

There's a lot of things that R does wrong (not their fault; language
design is an extremely difficult and specialized skill, that
statisticians are not exactly trained in), but it did make a few
excellent choices at the beginning. One was to steal the execution
model from Scheme, which, uh, isn't really relevant here. The other
was to steal the basic data types and standard library that the Bell
Labs statisticians had pounded into shape over many years. I use
Python now because using R for everything would drive me crazy, but
despite its many flaws, it still does some things so well that it's
become *the* language used for basically all statistical research. I'm
only talking about stealing those things :-).

-- Nathaniel