[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 13:05:38 EDT 2011

On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?

I guess that is a difference, but I'm trying to get at something more
fundamental -- not just what operations are allowed, but what
operations people *expect* to be allowed. It seems like some of us
have been talking past each other a lot, where someone says "but
changing masks is the single most important feature!" and then someone
else says "what are you talking about that doesn't even make sense".

> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0

Yes -- and in R you get actually do get NA, while in numpy.ma you
actually do get 0. I don't think this is a coincidence; I think it's
because they're designed as coherent systems that are trying to solve
different problems. (Well, numpy.ma's "hardmask" idea seems inspired
by the missing-data concept rather than the temporary-mask concept,
but aside from that it seems pretty consistent in implementing option
2.)

Here's another possible difference -- in (1), intuitively, missingness
is a property of the data, so the logical place to put information
about whether you can expect missing values is in the dtype, and to
enable missing values you need to make a new array with a new dtype.
(If we use a mask-based implementation, then
np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
to skip making a copy of the data -- I'm talking ONLY about the
interface here, not whether missing data has a different storage
format from non-missing data.)

In (2), the whole point is to use different masks with the same data,
so I'd argue masking should be a property of the array object rather
than the dtype, and the interface should logically allow masks to be
created, modified, and destroyed in place.

They're both internally consistent, but I think we might have to make
a decision and stick to it.

> I agree it's good to separate the API from the implementation.   I
> think the implementation is also important because I care about memory
> and possibly speed.  But, that is a separate problem from the API...

Yes, absolutely memory and speed are important. But a really fast
solution to the wrong problem isn't so useful either :-).

-- Nathaniel