[Numpy-discussion] Missing data wrap-up and request for comments

Nathaniel Smith njs at pobox.com
Wed May 9 18:55:44 EDT 2012


On Wed, May 9, 2012 at 5:46 PM, Travis Oliphant <travis at continuum.io> wrote:
> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate.   I think they've
> done an amazing job at providing some context, articulating their views and
> suggesting ways forward in a mutually respectful manner.   This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward.   I'm also
> reading the document incorporating my understanding of the history, of NumPy
> as well as all of the users I've met and interacted with which means I have
> my own perspective that is not necessarily incorporated into that document
> but informs my recommendations.    I'm not sure we can reach full consensus
> on this.     We are also well past time for moving forward with a resolution
> on this (perhaps we can all agree on that).

If we're talking about deciding what to do for the 1.7 release branch,
then I agree. Otherwise, I definitely don't. We really just don't
*know* what our users need with regards to mask-based storage versions
of missing data, so committing to something within a short time period
will just guarantee we have to re-do it all again later.

[Edit: I see that you've clarified this in a follow-up email -- great!]

> We need concrete proposals and so I will start with three.   Please feel
> free to comment on these proposals or add your own during the discussion.
>  I will stop paying attention to this thread next Wednesday (May 16th) (or
> earlier if the thread dies) and hope that by that time we can agree on a way
> forward.  If we don't have agreement, then I will move forward with what I
> think is the right approach.   I will either write the code myself or
> convince someone else to write it.

Again, I'm assuming that what you mean here is that we can't and
shouldn't delay 1.7 indefinitely for this discussion to play out, so
you're proposing that we give ourselves a deadline of 1 week to decide
how to at least get the release unblocked. Let me know if I'm
misreading, though...

> In all cases, we have agreement that bit-pattern dtypes should be added to
> NumPy.      We should work on these (int32, float64, complex64, str, bool)
> to start.    So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is

In the context of 1.7, this seems like a non-starter at this point, at
least if we're going to move in the direction of making decisions by
consensus. It might well be that we'll decide that the current
NEP-like API is what we want (or that some compatible super-set is).
But (as described in more detail in the NA-overview document), I think
there are still serious questions to work out about how and whether a
masked-storage/NA-semantics API is something we want as part of the
ndarray object at all. And Ralf with his release-manager hat says that
he doesn't want to release the current API unless we can guarantee
that some version of it will continue to be supported. To me that
suggests that this is off the table for 1.7.

> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)

I've been assuming something like a global variable, and some guards
added to all the top-level functions that take "maskna=" arguments, so
that it's impossible to construct an ndarray that has its "maskna"
flag set to True unless the flag has been toggled.

As I said in NA-overview, I'd be fine with this in principle, but only
if we're certain we're okay with the ABI consequences. And we should
be clear on the goal -- if we just want to let people play with the
API, then there are other options, such as my little experiment:
  https://github.com/njsmith/numpyNEP
(This is certainly less robust, but it works, and is probably a much
easier base for modifications to test alternative APIs.) If the goal
is just to keep the code in master, then that's fine too, though it
has both costs and benefits. (An example of a cost is that its
presence may complicate adding bitpattern NA support.)

> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
> keeps the masked array notions and the ufuncs keep the ability to handle
> arrays like ndmasked.    Ideally, numpy.ma would be changed to use ndmasked
> objects as their core.

If we're talking about 1.7, then what kind of status do you propose
these new objects would have in 1.7? Regular feature, totally
experimental, something else?

My only objection to this proposal is that committing to this approach
seems premature. The existing masked array objects act quite
differently from numpy.ma, so why do you believe that they're a good
foundation for numpy.ma, and why will users want to switch to their
semantics over numpy.ma's semantics? These aren't rhetorical
questions, it seems like they must have concrete answers, but I don't
know what they are.

Cheers,
- Nathaniel



More information about the NumPy-Discussion mailing list