[Numpy-discussion] in the NA discussion, what can we agree on?

Thu Nov 3 02:27:20 EDT 2011

I also mentioned this at the bottom of a reply to Benjamin, but to
make sure people joining the thread see it: I went ahead and put this
up on a github wiki page that everyone should be able to edit
  https://github.com/njsmith/numpy/wiki/NA-discussion-status

We could move it to the numpy wiki or whatever if people prefer, this
just seemed like the easiest way to get something up there that
everyone would have write access to.

-- Nathaniel

On Wed, Nov 2, 2011 at 4:37 PM, Nathaniel Smith <njs at pobox.com> wrote:
> Hi again,
>
> Okay, here's my attempt at an *uncontroversial* email!
>
> Specifically, I think it'll be easier to talk about this NA stuff if
> we can establish some common ground, and easier for people to follow
> if the basic points of agreement are laid out in one place. So I'm
> going to try and summarize just the things that we can agree about.
>
> Note that right now I'm *only* talking about what kind of tools we
> want to give the user -- i.e., what kind of problems we are trying to
> solve. AFAICT we don't have as much consensus on implementation
> matters, and anyway it's hard to make implementation decisions without
> knowing what we're trying to accomplish.
>
> 1) I think we have consensus that there are (at least) two different
> possible ways of thinking about this problem, with somewhat different
> constituencies. Let's call these two concepts "MISSING data" and
> "IGNORED data".
>
> 2) I also think we have at least a rough consensus on what these
> concepts mean, and what their supporters want from them:
>
> MISSING data:
> - Conceptually, MISSINGness acts like a property of a datum --
> assigning MISSING to a location is like assigning any other value to
> that location
> - Ufuncs and other operations must propagate these values by default,
> and there must be an option to cause them to be ignored
> - Must be competitive with NaNs in terms of speed and memory usage (or
> else people will just use NaNs)
> - Compatibility with R is valuable
> - To avoid user confusion, ideally it should *not* be possible to
> 'unmask' a missing value, since this is inconsistent with the "missing
> value" metaphor (e.g., see Wes's comment about "leaky abstractions")
> - Possible useful extension: having different classes of missing
> values (similar to Stata)
> - Target audience: data analysis with missing data, neuroimaging,
> econometrics, former R users, ...
>
> IGNORED data:
> - Conceptually, IGNOREDness acts like a property of the array --
> toggling a location to be IGNORED is kind of vaguely similar to
> changing an array's shape
> - Ufuncs and other operations must ignore these values by default, and
> there doesn't really need to be a way to propagate them, even as an
> option (though it probably wouldn't hurt either)
> - Some memory overhead is inevitable and acceptable
> - Compatibility with R neither possible nor valuable
> - Ability to toggle the IGNORED state of a location is critical, and
> should be as convenient as possible
> - Possible useful extension: having not just different types of
> ignored values, but richer ways to combine them -- e.g., the example
> of combining astronomical images with some kind of associated
> per-pixel quality scores, where one might want the 'mask' to be not
> just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
> multi-byte integer) or even a float, and to allow these 'masks' to be
> combined in some more complex way than just logical_and.
> - Target audience: anyone who's already doing this kind of thing by
> hand using a second mask array + boolean indexing, former numpy.ma
> users, matplotlib, ...
>
> 3) And perhaps we can all agree that the biggest *un*resolved question
> is whether we want to:
> - emphasize the similarities between these two use cases and build a
> single interface that can handle both concepts, with some compromises
> - or, treat these at two mostly-separate features that can each become
> exactly what the respective constituency wants without compromise --
> but with some potential redundancy and extra code.
> Each approach has advantages and disadvantages.
>
> Does that seem like a fair summary? Anything more we can add? Most
> importantly, anything here that you disagree with? Did I summarize
> your needs well? Do you have a use case that you feel doesn't fit
> naturally into either category?
>
> [Also, I thought this might make the start of a good wiki page for
> people to reference during these discussions, but I don't seem to have
> edit rights. If other people agree, maybe someone could put it up, or
> give me access? My trac id is njs at pobox.com.]
>
> Thanks,
> -- Nathaniel
>