[Numpy-discussion] missing data discussion round 2

Mon Jun 27 15:59:34 EDT 2011

On Mon, Jun 27, 2011 at 2:24 PM, eat <e.antero.tammi at gmail.com> wrote:
>
>
> On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>
>> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.tammi at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>>>
>>>> First I'd like to thank everyone for all the feedback you're providing,
>>>> clearly this is an important topic to many people, and the discussion has
>>>> helped clarify the ideas for me. I've renamed and updated the NEP, then
>>>> placed it into the master NumPy repository so it has a more permanent home
>>>> here:
>>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
>>>> In the NEP, I've tried to address everything that was raised in the
>>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal with
>>>> the issue of whether a mask is True or False for a missing value, I've
>>>> removed the 'mask' attribute entirely, except for ufunc-like functions
>>>> np.ismissing and np.isavail which return the two styles of masks. Here's a
>>>> high level summary of how I'm thinking of the topic, and what I will
>>>> implement:
>>>> Missing Data Abstraction
>>>> There appear to be two useful ways to think about missing data that are
>>>> worth supporting.
>>>> 1) Unknown yet existing data
>>>> 2) Data that doesn't exist
>>>> In 1), an NA value causes outputs to become NA except in a small number
>>>> of exceptions such as boolean logic, and in 2), operations treat the data as
>>>> if there were a smaller array without the NA values.
>>>> Temporarily Ignoring Data
>>>> In some cases, it is useful to flag data as NA temporarily, possibly in
>>>> several different ways, for particular calculations or testing out different
>>>> ways of throwing away outliers. This is independent of the missing data
>>>> abstraction, still requiring a choice of 1) or 2) above.
>>>> Implementation Techniques
>>>> There are two mechanisms generally used to implement missing data
>>>> abstractions,
>>>> 1) An NA bit pattern
>>>> 2) A mask
>>>> I've described a design in the NEP which can include both techniques
>>>> using the same interface. The mask approach is strictly more general than
>>>> the NA bit pattern approach, except for a few things like the idea of
>>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
>>>> My intention is to implement the mask-based design, and possibly also
>>>> implement the NA bit pattern design, but if anything gets cut it will be the
>>>> NA bit patterns.
>>>> Thanks again for all your input so far, and thanks in advance for your
>>>> suggestions for improving this new revision of the NEP.
>>>
>>> A very impressive PEP indeed.
>
> Hi,
>>>
>>> However, how would corner cases, like
>>>
>>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
>>> >>> np.mean(a, skipna=True)
>>
>> This should be equivalent to removing all the NA values, then calling
>> mean, like this:
>> >>> b = np.array([], dtype='f8')
>> >>> np.mean(b)
>>
>> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374:
>> RuntimeWarning: invalid value encountered in double_scalars
>>   return mean(axis, dtype, out)
>> nan
>>>
>>> >>> np.mean(a)
>>
>> This would return NA, since NA values are sitting in positions that would
>> affect the output result.
>
> OK.
>>
>>
>>>
>>> be handled?
>>> My concern here is that there always seems to be such corner cases which
>>> can only be handled with specific context knowledge. Thus producing 100%
>>> generic code to handle 'missing data' is not doable.
>>
>> Working out the corner cases for the functions that are already in numpy
>> seems tractable to me, how to or whether to support missing data is
>> something the author of each new function will have to consider when missing
>> data support is in NumPy, but I don't think we can do more than provide the
>> mechanisms for people to use.
>
> Sure. I'll ride up with this and wait when I'll have some tangible to
> outperform the 'traditional' NaN handling.
> - eat

Just a question how things would work with the new model.
How can you implement the "use" keyword from R's cov (or cor), with
minimal data copying

I think the basic masked array version would (or does) just assign 0
to the missing values calculate the covariance or correlation and then
correct with the correct count.

------------
cov(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

cor(x, y = NULL, use = "everything",
     method = c("pearson", "kendall", "spearman"))

cov2cor(V)

Arguments
x   a numeric vector, matrix or data frame.
  y  NULL (default) or a vector, matrix or data frame with compatible
dimensions to x. The default is equivalent to y = x (but more
efficient).
  na.rm   logical. Should missing values be removed?

  use   an optional character string giving a method for computing
covariances in the presence of missing values. This must be (an
abbreviation of) one of the strings "everything", "all.obs",
"complete.obs", "na.or.complete", or "pairwise.complete.obs".
------------

especially I'm interested in the complete.obs (drop any rows that
contains a NA) case

Josef

>>
>> -Mark
>>
>>>
>>> Thanks,
>>> - eat
>>>>
>>>> -Mark
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>