[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 14:06:14 EDT 2011

On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith <njs at pobox.com> wrote:
> > So obviously there's a lot of interest in this question, but I'm
> > losing track of all the different issues that've being raised in the
> > 150-post thread of doom. I think I'll find this easier if we start by
> > putting aside the questions about implementation and such and focus
> > for now on the *conceptual model* that we want. Maybe I'm not the only
> > one?
> >
> > So as far as I can tell, there are three different ways of thinking
> > about masked/missing data that people have been using in the other
> > thread:
> >
> > 1) Missingness is part of the data. Some data is missing, some isn't,
> > this might change through computation on the data (just like some data
> > might change from a 3 to a 6 when we apply some transformation, NA |
> > True could be True, instead of NA), but we can't just "decide" that
> > some data is no longer missing. It makes no sense to ask what value is
> > "really" there underneath the missingness. And It's critical that we
> > keep track of this through all operations, because otherwise we may
> > silently give incorrect answers -- exactly like it's critical that we
> > keep track of the difference between 3 and 6.
>
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?
>
>
Yes, bingo, you hit it right on the nose.  Essentially, 1) could be
considered the "hard mask", while 2) would be the "soft mask".  Everything
else is implementation details.

> > 2) All the data exists, at least in some sense, but we don't always
> > want to look at all of it. We lay a mask over our data to view and
> > manipulate only parts of it at a time. We might want to use different
> > masks at different times, mutate the mask as we go, etc. The most
> > important thing is to provide convenient ways to do complex
> > manipulations -- preserve masks through indexing operations, overlay
> > the mask from one array on top of another array, etc. When it comes to
> > other sorts of operations then we'd rather just silently skip the
> > masked values -- we know there are values that are masked, that's the
> > whole point, to work with the unmasked subset of the data, so if sum
> > returned NA then that would just be a stupid hassle.
>
> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0
>
> ?
>

Actually, I have always considered this to be a bug.  Note that "np.sum([])"
also returns 0.0.  I think the reason why it has been returning zero instead
of NaN was because there wasn't a NaN-equivalent for integers.  This is
where I think a np.NA could best serve NumPy by providing a dtype-agnostic
way to represent missing or invalid data.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/1616f4ba/attachment.html>