[Numpy-discussion] Missing data again

Ralf Gommers ralf.gommers at googlemail.com
Tue Mar 6 16:14:00 EST 2012


On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant <travis at continuum.io>
> wrote:
> > Hi all,
>
> Hi Travis,
>
> Thanks for bringing this back up.
>
> Have you looked at the summary from the last thread?
>  https://github.com/njsmith/numpy/wiki/NA-discussion-status
>

Re-reading that summary and the main documents and threads linked from it,
I could find either examples of statistical software that treats missing
and ignored data explicitly separately, or links to relevant literature.
Those would probably help the discussion a lot.

The goal was to try and at least work out what points we all *could*
> agree on, to have some common footing for further discussion. I won't
> copy the whole thing here, but I'd summarize the state as:
>  -- It's pretty clear that there are two fairly different conceptual
> models/use cases in play here. For one of them (R-style "missing data"
> cases) it's pretty clear what the desired semantics would be. For the
> other (temporary "ignored values") there's still substantive
> disagreement.
>  -- We *haven't* yet established what we want numpy to actually support.
>
> IMHO the critical next step is this latter one -- maybe we want to
> fully support both use cases. Maybe it's really only one of them
> that's worth trying to support in the numpy core right now. Maybe it's
> just one of them, but it's worth doing so thoroughly that it should
> have multiple implementations. Or whatever.
>
> I fear that if we don't talk about these big picture questions and
> just wade directly back into round-and-round arguments about API
> details then we'll never get anywhere.
>
> [...]
> > Because it is slated to go into release 1.7, we need to re-visit the
> masked array discussion again.    The NEP process is the appropriate one
> and I'm glad we are taking that route for these discussions.   My goal is
> to get consensus in order for code to get into NumPy (regardless of who
> writes the code).    It may be that we don't come to a consensus
> (reasonable and intelligent people can disagree on things --- look at the
> coming election...).   We can represent different parts of what is
> fortunately a very large user-base of NumPy users.
> >
> > First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code.  There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data.   I'm sure there are other things as well that I'm not
> quite aware of yet.    However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
> >
> > A few particulars:
> >
> >        * the reduction operations need to default to "skipna" --- this
> is the most common use case which has been re-inforced again to me today by
> a new user to Python who is using masked arrays presently
>
> This is one of the points where the two conceptual models disagree
> (see also Skipper's point down-thread). If you have "missing data",
> then propagation has to be the default -- the sum of 1, 2, and
> I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
> but you've asked numpy to temporarily ignore it, then, well, duh, of
> course it should ignore it.
>
> >        * the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>
> This is also a point where the two conceptual models disagree.
>
> Actually this is one of the original arguments we made against the NEP
> design -- that if you want missing data, then having a mask at all is
> counterproductive, and if you are ignoring data, then of course it
> should be easy to manipulate the ignore mask. The rationale for the
> current design is to compromise between these two approaches -- there
> is a mask, but it's hidden behind a curtain. Mostly. (This may be a
> compromise in the Solomonic sense.)
>
> >        * bit-pattern approaches to missing data (at least for float64
> and int32) need to be implemented.
> >
> >        * there should be some way when using "masks" (even if it's
> hidden from most users) for missing data to separate the low-level ufunc
> operation from the operation
> >           on the masks...
>
> I don't understand what this means.
>
> > I have heard from several users that they will *not use the missing
> data* in NumPy as currently implemented, and I can now see why.    For
> better or for worse, my approach to software is generally very user-driven
> and very pragmatic.  On the other hand, I'm also a mathematician and
> appreciate the cognitive compression that can come out of well-formed
> structure.    None-the-less, I'm an *applied* mathematician and am
> ultimately motivated by applications.
> >
> > I will get a hold of the NEP and spend some time with it to discuss some
> of this in that document.   This will take several weeks (as PyCon is next
> week and I have a tutorial I'm giving there).    For now, I do not think
> 1.7 can be released unless the masked array is labeled *experimental*.
>
> In project management terms, I see three options:
> 1) Put a big warning label on the functionality and leave it for now
> ("If this option is given, np.asarray returns a masked array. NOTE: IN
> THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
> WEASELS. NO GUARANTEES.")
>

I've opened http://projects.scipy.org/numpy/ticket/2072 for that. Assuming
we stick with this option, I'd appreciate it if you could check in the
first beta that comes out whether or not the warnings are obvious enough
and in all the right places. There probably won't be weasels though:)


> 2) Move the code back out of mainline and into a branch until until
> there's consensus.
> 3) Hold up the release until this is all sorted.
>
> I come from the project-management school that says you should always
> have a releasable mainline, keep unready code in branches, and never
> hold up the release for features, so (2) seems obvious to me.


While it may sound obvious, I hope you've understood why in practice it's
not at all obvious and why you got such strong reactions to your proposal
of taking out all that code. If not, just look at what happened with the
numpy-refactor work.

But I seem to be very much in the minority on that[1], so oh well :-). I
> don't have any objection to (1), personally. (3) seems like a bad
> idea. Just my 2 pence.
>

Agreed that (3) is a bad idea. +1 for (1).

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120306/07b7a1a9/attachment.html>


More information about the NumPy-Discussion mailing list