[Numpy-discussion] Missing data wrap-up and request for comments

Wed May 9 15:44:24 EDT 2012

On 05/09/2012 06:46 PM, Travis Oliphant wrote:
> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate. I think they've
> done an amazing job at providing some context, articulating their views
> and suggesting ways forward in a mutually respectful manner. This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward. I'm
> also reading the document incorporating my understanding of the history,
> of NumPy as well as all of the users I've met and interacted with which
> means I have my own perspective that is not necessarily incorporated
> into that document but informs my recommendations. I'm not sure we can
> reach full consensus on this. We are also well past time for moving
> forward with a resolution on this (perhaps we can all agree on that).
>
> I would like one more discussion thread where the technical discussion
> can take place. I will make a plea that we keep this discussion as free
> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
> we can. I can't guarantee that I personally will succeed at that, but I
> can tell you that I will try. That's all I'm asking of anyone else. I
> recognize that there are a lot of other issues at play here besides
> *just* the technical questions, but we are not going to resolve every
> community issue in this technical thread.
>
> We need concrete proposals and so I will start with three. Please feel
> free to comment on these proposals or add your own during the
> discussion. I will stop paying attention to this thread next Wednesday
> (May 16th) (or earlier if the thread dies) and hope that by that time we
> can agree on a way forward. If we don't have agreement, then I will move
> forward with what I think is the right approach. I will either write the
> code myself or convince someone else to write it.
>
> In all cases, we have agreement that bit-pattern dtypes should be added
> to NumPy. We should work on these (int32, float64, complex64, str, bool)
> to start. So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is
>
> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)
>
> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged. The
> array_interface keeps the masked array notions and the ufuncs keep the
> ability to handle arrays like ndmasked. Ideally, numpy.ma
> <http://numpy.ma> would be changed to use ndmasked objects as their core.
>
> For the record, I'm currently in favor of the third proposal. Feel free
> to comment on these proposals (or provide your own).
>

Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

The third proposal is certainly the best one from Cython's perspective; 
and I imagine for those writing C extensions against the C API too. 
Having PyType_Check fail for ndmasked is a very good way of having code 
fail that is not written to take masks into account.

If it is in ndarray we would also have some pressure to add support in 
Cython, with ndmasked we avoid that too. Likely outcome is we won't ever 
support it either way, but then we need some big warning in the docs, 
and it's better to avoid that. (I guess be +0 on Mark Florisson 
implementing it if it ends up in core ndarray; I'd almost certainly not 
do it myself.)

That covers Cython. My view as a NumPy user follows.

I'm a heavy user of masks, which are used to make data NA in the 
statistical sense. The setting is that we have to mask out the radiation 
coming from the Milky Way in full-sky images of the Cosmic Microwave 
Background. There's data, but we know we can't trust it, so we make it 
NA. But we also do play around with different masks.

Today we keep the mask in a seperate array, and to zero-mask we do

masked_data = data * mask

or

masked_data = data.copy()
masked_data[mask == 0] = np.nan # soon np.NA

depending on the circumstances.

Honestly, API-wise, this is as good as its gets for us. Nice and 
transparent, no new semantics to learn in the special case of masks.

Now, this has performance issues: Lots of memory use, extra transfers 
over the memory bus.

BUT, NumPy has that problem all over the place, even for "x + y + z"! 
Solving it in the special case of masks, by making a new API, seems a 
bit myopic to me.

IMO, that's much better solved at the fundamental level. As an 
*illustration*:

with np.lazy:
     masked_data1 = data * mask1
     masked_data2 = data * (mask1 | mask2)
     masked_data3 = (x + y + z) * (mask1 & mask3)

This would create three "generator arrays" that would zero-mask the 
arrays (and perform the three-term addition...) upon request. You could 
slice the generator arrays as you wish, and by that slice the data and 
the mask in one operation. Obviously this could handle NA-masking too.

You can probably do this today with Theano and numexpr, and I think 
Travis mentioned that "generator arrays" are on his radar for core NumPy.

Point is, as a user, I'm with Travis in having masks support go hide in 
ndmasked; they solve too much of a special case in a way that is too 
particular.

Dag