[Numpy-discussion] missing data discussion round 2

eat e.antero.tammi at gmail.com
Mon Jun 27 14:24:31 EDT 2011


On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:

> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.tammi at gmail.com> wrote:
>
>> Hi,
>>
>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>
>>> First I'd like to thank everyone for all the feedback you're providing,
>>> clearly this is an important topic to many people, and the discussion has
>>> helped clarify the ideas for me. I've renamed and updated the NEP, then
>>> placed it into the master NumPy repository so it has a more permanent home
>>> here:
>>>
>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
>>>
>>> In the NEP, I've tried to address everything that was raised in the
>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal with
>>> the issue of whether a mask is True or False for a missing value, I've
>>> removed the 'mask' attribute entirely, except for ufunc-like functions
>>> np.ismissing and np.isavail which return the two styles of masks. Here's a
>>> high level summary of how I'm thinking of the topic, and what I will
>>> implement:
>>>
>>> *Missing Data Abstraction*
>>>
>>> There appear to be two useful ways to think about missing data that are
>>> worth supporting.
>>>
>>> 1) Unknown yet existing data
>>> 2) Data that doesn't exist
>>>
>>> In 1), an NA value causes outputs to become NA except in a small number
>>> of exceptions such as boolean logic, and in 2), operations treat the data as
>>> if there were a smaller array without the NA values.
>>>
>>> *Temporarily Ignoring Data*
>>> *
>>> *
>>> In some cases, it is useful to flag data as NA temporarily, possibly in
>>> several different ways, for particular calculations or testing out different
>>> ways of throwing away outliers. This is independent of the missing data
>>> abstraction, still requiring a choice of 1) or 2) above.
>>>
>>> *Implementation Techniques*
>>> *
>>> *
>>> There are two mechanisms generally used to implement missing data
>>> abstractions,
>>> *
>>> *
>>> 1) An NA bit pattern
>>> 2) A mask
>>>
>>> I've described a design in the NEP which can include both techniques
>>> using the same interface. The mask approach is strictly more general than
>>> the NA bit pattern approach, except for a few things like the idea of
>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
>>>
>>> My intention is to implement the mask-based design, and possibly also
>>> implement the NA bit pattern design, but if anything gets cut it will be the
>>> NA bit patterns.
>>>
>>> Thanks again for all your input so far, and thanks in advance for your
>>> suggestions for improving this new revision of the NEP.
>>>
>> A very impressive PEP indeed.
>>
> Hi,

>
>> However, how would corner cases, like
>>
>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
>> >>> np.mean(a, skipna=True)
>>
>> This should be equivalent to removing all the NA values, then calling
> mean, like this:
>
> >>> b = np.array([], dtype='f8')
> >>> np.mean(b)
> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374:
> RuntimeWarning: invalid value encountered in double_scalars
>   return mean(axis, dtype, out)
> nan
>
> >>> np.mean(a)
>>
>> This would return NA, since NA values are sitting in positions that would
> affect the output result.
>
OK.

>
>
>> be handled?
>>
>> My concern here is that there always seems to be such corner cases which
>> can only be handled with specific context knowledge. Thus producing 100%
>> generic code to handle 'missing data' is not doable.
>>
>
> Working out the corner cases for the functions that are already in numpy
> seems tractable to me, how to or whether to support missing data is
> something the author of each new function will have to consider when missing
> data support is in NumPy, but I don't think we can do more than provide the
> mechanisms for people to use.
>
Sure. I'll ride up with this and wait when I'll have some tangible to
outperform the 'traditional' NaN handling.

- eat

>
> -Mark
>
>
>> Thanks,
>> - eat
>>
>>>
>>> -Mark
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110627/33f33f4c/attachment.html>


More information about the NumPy-Discussion mailing list