[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Thu Jun 23 20:00:38 EDT 2011

On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Thu, Jun 23, 2011 at 15:53, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> Enthought has asked me to look into the "missing data" problem and how NumPy
>> could treat it better. I've considered the different ideas of adding dtype
>> variants with a special signal value and masked arrays, and concluded that
>> adding masks to the core ndarray appears is the best way to deal with the
>> problem in general.
>> I've written a NEP that proposes a particular design, viewable here:
>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
>> There are some questions at the bottom of the NEP which definitely need
>> discussion to find the best design choices. Please read, and let me know of
>> all the errors and gaps you find in the document.
>
> One thing that could use more explanation is how your proposal
> improves on the status quo, i.e. numpy.ma. As far as I can see, you
> are mostly just shuffling around the functionality that already
> exists. There has been a continual desire for something like R's NA
> values by people who are very familiar with both R and numpy's masked
> arrays. Both have their uses, and as Nathaniel points out, R's
> approach seems to be very well-liked by a lot of users. In essence,
> *that's* the "missing data problem" that you were charged with: making
> happy the users who are currently dissatisfied with masked arrays. It
> doesn't seem to me that moving the functionality from numpy.ma to
> numpy.ndarray resolves any of their issues.

Speaking as a user who's avoided numpy.ma, it wasn't actually because
of the behavior I pointed out (I never got far enough to notice it),
but because I got the distinct impression that it was a "second-class
citizen" in numpy-land. I don't know if that's true. But I wasn't sure
how solidly things like interactions between numpy and masked arrays
worked, or how , and it seemed like it had more niche uses. So it just
seemed like more hassle than it was worth for my purposes. Moving it
into the core and making it really solid *would* address these
issues...

It does have to be solid, though. It occurs to me on further thought
that one major advantage of having first-class "NA" values is that it
preserves the standard looping idioms:

for i in xrange(len(x)):
  x[i] = np.log(x[i])

According to the current proposal, this will blow up, but np.log(x)
will work. That seems suboptimal to me.

I do find the argument that we want a general solution compelling. I
suppose we could have a magic "NA" value in Python-land which
magically triggers fiddling with the mask when assigned to numpy
arrays.

It's should also be possible to accomplish a general solution at the
dtype level. We could have a 'dtype factory' used like:
  np.zeros(10, dtype=np.maybe(float))
where np.maybe(x) returns a new dtype whose storage size is x.itemsize
+ 1, where the extra byte is used to store missingness information.
(There might be some annoying alignment issues to deal with.) Then for
each ufunc we define a handler for the maybe dtype (or add a
special-case to the ufunc dispatch machinery) that checks the
missingness value and then dispatches to the ordinary ufunc handler
for the wrapped dtype.

This would require fixing the issue where ufunc inner loops can't
actually access the dtype object, but we should fix that anyway :-).

-- Nathaniel