[Numpy-discussion] Removing masked arrays for 1.7? (Was 1.7 blockers)

Nathaniel Smith njs at pobox.com
Tue Apr 17 08:52:50 EDT 2012


On Tue, Apr 17, 2012 at 6:44 AM, Travis Oliphant <travis at continuum.io> wrote:
> Basically, there are two sets of changes as far as I understand right now:
>
>        1) ufunc infrastructure understands masked arrays
>        2) ndarray grew attributes to represent masked arrays
>
> I am proposing that we keep 1) but change 2) so that only certain kinds of NumPy arrays actually have the extra function pointers (effectively a sub-type).   In essence, what I'm proposing is that the NumPy 1.6 PyArrayObject become a base-object, but the other members of the C-structure are not even present unless the Masked flag is set.   Such changes would not require ripping code out --- just altering the presentation a bit.   Yet, they could have large long-term implications, that we should explore before they get fixed.
>
> Whether masked arrays should be a formal sub-class is actually an un-related question and I generally lean in the direction of not encouraging sub-classes of the ndarray.   The big questions are does this object work in the calculation infrastructure.   Can I add an array to a masked array.   Does it have a sum method?   I think it could be argued that a masked array does have a "is a" relationship with an array.   It can also be argued that it is better to have a "has a" relationship with an array and be-it's own-object.   Either way, this object could still have it's first-part be binary compatible with a NumPy Array, and that is what I'm really suggesting.

It sounds like the main implementation issue here is that this masked
array class needs some way to coordinate with the ufunc infrastructure
to efficiently and reliably handle the mask in calculations. The core
ufunc code now knows how to handle masks, and this functionality is
needed for where= and NA-dtypes, so obviously it's staying,
independent of what we decide to do with masked arrays. So the
question is just, how do we get the masked array and the ufuncs
talking to each other so they can do the right thing. Perhaps we
should focus, then, on how to create a better hooking mechanism for
ufuncs? Something along these lines?
  http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056945.html
If done in a solid enough way, this would also solve other problems,
e.g. we could make ufuncs work reliably on sparse matrices, which
seems to trip people up on scipy-user every month or two. Of course,
it's very tricky to get right :-(

As far the masked array API: I'm still not convinced we know how we
want these things to behave. The masked arrays in master currently
implement MISSING semantics, but AFAICT everyone who wants MISSING
semantics prefers NA-dtypes or even plain old NaN's over a masked
implementation. And some of the current implementation's biggest
backers, like Chuck, have argued that they should switch to
skipNA=True, which is more of an IGNORED-style semantic. OTOH, there's
still disagreement over how IGNORED-style semantics should even work
(I'm thinking of that discussion about commutivity). The best existing
model is numpy.ma -- but the numpy.ma API is quite different from the
NEP, in more ways than just the default setting for skipNA. numpy.ma
uses the opposite convention for mask values, it has additional
concepts like the fillvalue, hardmask versus softmask, and then
there's the whole way the NEP uses views to manage the mask. And I
don't know which of these numpy.ma features are useful, which are
extraneous, and which are currently useful but will become extraneous
once the users who really wanted something more like NA-dtypes switch
to those.

So we all agree that masked arrays can be useful, and that numpy.ma
has problems. But straightforwardly porting numpy.ma to C doesn't seem
like it would help much, and neither does simply declaring that
numpy.ma has been deprecated in favour of a new NEP-like API.

So, I dunno. It seems like it might make the most sense to:
1) take the mask fields out of the core ndarray (while leaving the
rest of Mark's infrastructure, as per above)
2) make sure we have the hooks needed so that numpy.ma, and NEP-like
APIs, and whatever other experiments people want to try, can all
integrate well with ufuncs, and make any other extensions that are
generally useful and required so that they can work well
3) once we've experimented, move the winner into the core. Or whatever
else makes sense to do once we understand what we're trying to
accomplish.

-- Nathaniel



More information about the NumPy-Discussion mailing list