[Numpy-discussion] alterNEP - was: missing data discussion round 2

Fri Jul 1 23:47:01 EDT 2011

On Fri, Jul 1, 2011 at 9:18 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> I am sorry that that is NOT true - DON'T just lump every one into this
> when they have clearly stated the opposite! Missing values are nothing
> special to me, just reality. There are many statistical applications
> where masking is extremely common like outlier detection and flagging
> unusual observations (missing values is also masking). Just that you as
> a user have to do that yourself by creating and maintaining working
> variables.

Thanks for speaking up -- we all definitely want something that will
work as well as possible for everyone! I'm a little confused about
what you're saying, though -- I assume that you mean that you're happy
with the NEP proposal for handling NA values[1], and so I
misrepresented you when I said that everyone doing statistics with
missing values had concerns about the NEP? If so, then my apologies.

[1] https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e41aa/doc/neps/c-masked-array.rst

> I really find that you are 'splitting hairs' in your arguments as it
> really has to be up to the application on how missing values and NaN
> have to be handled. I see no difference between a missing value and a
> NaN because in virtually all statistical applications, both of these are
> dropped. This is what SAS typically does although certain procedure like
> FREQ allow you to treat missing values as 'valid'. R has slightly more
> flexibility since it differentiates missing valves and NaN. R allows you
> to decide how missing values are handled using arguments like na.rm or
> using na.fail, na.omit, na.exclude, na.pass functions.  But I think for
> the majority of cases (I'm not an R guru), R acts the same way as, by
> default (which is how most people use R) R excludes missing values and
> NaN's.

Is your point here that NA and NaN are pretty similar, so it's
splitting hairs to differentiate them? They are pretty similar, but
this is the justification I wrote for having both in the alterNEP
(https://gist.github.com/1056379):

"For floating point computations, NAs and NaNs have (almost?)
identical behavior. But they represent different things -- NaN an
invalid computation like 0/0, NA a value that is not available -- and
distinguishing between these things is useful because in some
situations they should be treated differently. (For example, an
imputation procedure should replace NAs with imputed values, but
probably should leave NaNs alone.) And anyway, we can't use NaNs for
integers, or strings, or booleans, so we need NA anyway, and once we
have NA support for all these types, we might as well support it for
floating point too for consistency."

Does that seem reasonable?

In any case, my arguments haven't really been about NA versus NaN --
everyone seems to agree that we want something like NA. In the NEP
proposal, there are two different versions of NAs, one that's
implemented using special values (e.g., a special NaN that means NA)
and one that's implemented by using a secondary mask array. My
argument has been that for people who just want NAs, this secondary
mask version is redundant and confusing; but the mask version doesn't
really help the people who want "masked arrays" either, because it's
working too hard to be compatible with NAs, and the masked array
people want different behavior (unmasking, automatic skipping of NAs,
etc.). So it doesn't really work well for anybody.

-- Nathaniel