[Numpy-discussion] missing data discussion round 2

Mon Jun 27 13:32:52 EDT 2011

On Mon, Jun 27, 2011 at 8:18 PM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Mon, Jun 27, 2011 at 5:53 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Mon, Jun 27, 2011 at 9:55 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> >>
> >> First I'd like to thank everyone for all the feedback you're providing,
> >> clearly this is an important topic to many people, and the discussion
> has
> >> helped clarify the ideas for me. I've renamed and updated the NEP, then
> >> placed it into the master NumPy repository so it has a more permanent
> home
> >> here:
> >> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
> >> In the NEP, I've tried to address everything that was raised in the
> >> original thread and in Nathaniel's followup 'Concepts' thread. To deal
> with
> >> the issue of whether a mask is True or False for a missing value, I've
> >> removed the 'mask' attribute entirely, except for ufunc-like functions
> >> np.ismissing and np.isavail which return the two styles of masks. Here's
> a
> >> high level summary of how I'm thinking of the topic, and what I will
> >> implement:
> >> Missing Data Abstraction
> >> There appear to be two useful ways to think about missing data that are
> >> worth supporting.
> >> 1) Unknown yet existing data
> >> 2) Data that doesn't exist
> >> In 1), an NA value causes outputs to become NA except in a small number
> of
> >> exceptions such as boolean logic, and in 2), operations treat the data
> as if
> >> there were a smaller array without the NA values.
> >> Temporarily Ignoring Data
> >> In some cases, it is useful to flag data as NA temporarily, possibly in
> >> several different ways, for particular calculations or testing out
> different
> >> ways of throwing away outliers. This is independent of the missing data
> >> abstraction, still requiring a choice of 1) or 2) above.
> >> Implementation Techniques
> >> There are two mechanisms generally used to implement missing data
> >> abstractions,
> >> 1) An NA bit pattern
> >> 2) A mask
> >> I've described a design in the NEP which can include both techniques
> using
> >> the same interface. The mask approach is strictly more general than the
> NA
> >> bit pattern approach, except for a few things like the idea of
> supporting
> >> the dtype 'NA[f8,InfNan]' which you can read about in the NEP.
> >> My intention is to implement the mask-based design, and possibly also
> >> implement the NA bit pattern design, but if anything gets cut it will be
> the
> >> NA bit patterns.
> >
> > I have the impression that the mask-based design would be easier. Perhaps
> > you could do that one first and folks could try out the API and see how
> they
> > like it and discover whether the memory overhead is a problem in
> practice.
>
> That seems like a risky strategy to me, as the most likely outcome is
> that people worried about memory will avoid masked arrays because they
> know they use more memory.  The memory usage is predictable and we
> won't learn any more about it from use.  We most of us already know if
> we're having to optimize code for memory.
>
> You won't get complaints, you'll just lose a group of users, who will,
> I suspect, stick to NaNs, unsatisfactory as they are.
>
+1

- eat

>
> See you,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110627/0abfb424/attachment.html>