[Numpy-discussion] missing data discussion round 2

Mark Wiebe mwwiebe at gmail.com
Mon Jun 27 22:04:30 EDT 2011


On Mon, Jun 27, 2011 at 7:07 PM, Keith Goodman <kwgoodman at gmail.com> wrote:

> On Mon, Jun 27, 2011 at 8:55 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > First I'd like to thank everyone for all the feedback you're providing,
> > clearly this is an important topic to many people, and the discussion has
> > helped clarify the ideas for me. I've renamed and updated the NEP, then
> > placed it into the master NumPy repository so it has a more permanent
> home
> > here:
> > https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
> > In the NEP, I've tried to address everything that was raised in the
> original
> > thread and in Nathaniel's followup 'Concepts' thread. To deal with the
> issue
> > of whether a mask is True or False for a missing value, I've removed the
> > 'mask' attribute entirely, except for ufunc-like functions np.ismissing
> and
> > np.isavail which return the two styles of masks. Here's a high level
> summary
> > of how I'm thinking of the topic, and what I will implement:
> > Missing Data Abstraction
> > There appear to be two useful ways to think about missing data that are
> > worth supporting.
> > 1) Unknown yet existing data
> > 2) Data that doesn't exist
> > In 1), an NA value causes outputs to become NA except in a small number
> of
> > exceptions such as boolean logic, and in 2), operations treat the data as
> if
> > there were a smaller array without the NA values.
> > Temporarily Ignoring Data
> > In some cases, it is useful to flag data as NA temporarily, possibly in
> > several different ways, for particular calculations or testing out
> different
> > ways of throwing away outliers. This is independent of the missing data
> > abstraction, still requiring a choice of 1) or 2) above.
> > Implementation Techniques
> > There are two mechanisms generally used to implement missing data
> > abstractions,
> > 1) An NA bit pattern
> > 2) A mask
> > I've described a design in the NEP which can include both techniques
> using
> > the same interface. The mask approach is strictly more general than the
> NA
> > bit pattern approach, except for a few things like the idea of supporting
> > the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
> > My intention is to implement the mask-based design, and possibly also
> > implement the NA bit pattern design, but if anything gets cut it will be
> the
> > NA bit patterns.
> > Thanks again for all your input so far, and thanks in advance for your
> > suggestions for improving this new revision of the NEP.
>
> I'm trying to understand this part of the missing data NEP:
>
> "While numpy.NA works to mask values, it does not itself have a dtype.
> This means that returning the numpy.NA singleton from an operation
> like 'arr[0]' would be throwing away the dtype, which is still
> valuable to retain, so 'arr[0]' will return a zero-dimensional array
> either with its value masked, or containing the NA bit pattern for the
> array's dtype."
>
> If I do something like this in Cython:
>
>    cdef np.float64_t ai
>    for i in range(n):
>        ai = a[i]
>        ...
>
> Then I need to specify the type of ai, say float64 as above.
>
> What happens when a[i] is np.NA? Is ai still a float64? If NA is a bit
> pattern taken from float64 then a[i] could be float64, but if it is a
> 0d array then it would not be float64 and I assume I would run into
> problems or have to cast.
>
> So what does all this mean for iterating over each element of an array
> in Cython or C? Would I need to check the mask of element i first and
> only assign to ai if the mask is True (meaning not missing)?
>

I'll have to add mention of Cython in the NEP. What should happen in Cython
is the same thing that happens in Python, that the abstractions described in
the NEP are followed precisely. Until the ability to work with missing
values is added to Cython, the above will not be possible. The type
np.float64_t isn't correct, Cython will need to add its own versions of
np.nafloat64_t which it translates to/from by calling the appropriate NumPy
APIs.

-Mark


> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110627/4190fc31/attachment.html>


More information about the NumPy-Discussion mailing list