[Numpy-discussion] NA/Missing Data Conference Call Summary

Wed Jul 6 13:54:15 EDT 2011

On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> Just for reference, I am using this as the latest version of the NEP -
> I hope it's current:
>
>
> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>
> I'm mostly relaying stuff I said, although generally (please do
> correct me if I am wrong) I am just re-expressing points that
> Nathaniel has already made in the alterNEP text and the emails.
>
> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
> ...
> > Since we only have Mark is only around Austin until early August, there's
> > also broad agreement that we need to get something done quickly.
>
> I think I might have missed that part of the discussion :)
>
>
I think that might have been mentioned by Travis right before he had to
leave for another meeting, which might have been after you'd disconnected.
Travis' concern as a member of a numpy community is the desire for something
that is broadly applicable and adopted. But as Mark's employer, his concern
is to get a more complete and coherent missing data functionality
implemented in numpy while Mark is still at Enthought, for use in the
problems Enthought and statisticians commonly encounter if nothing else.

> I feel the need to emphasize the centrality of the assertion by
> Nathaniel, and agreement by (at least) me, that the NA case (there
> really is no data) and the IGNORE case (there is data but I'm
> concealing it from you) are conceptually different, and come from
> different use-cases.
>
> The underlying disagreement returned many times to this fundamental
> difference between the NEP and alterNEP:
>
> In the NEP - by design - it is impossible to distinguish between na.NA
> and na.IGNORE
> The alterNEP insists you should be able to distinguish.
>
> Mark says something like "it's all missing data, there's no reason you
> should want to distinguish".  Nathaniel and I were saying "the two
> types of missing do have different use-cases, and it should be
> possible to distinguish.  You might want to chose to treat them the
> same, but you should be able to see what they are.".
>
> I returned several times to this (original point by Nathaniel):
>
> a[3] = np.NA
>
> (what does this mean?   I am altering the underlying array, or a mask?
>  How would I explain this to someone?)
>
> We confirmed that, in order to make it difficult to know what your NA
> is (masked or bit-pattern), Mark has to a) hinder access to the data
> below the mask and b) prevent direct API access to the masking array.
> I described this as 'hobbling the API' and Mark thought of it as
> 'generic programming' (missing is always missing).
>
> I asserted that explaining NA to people would be easier if ``a[3] =
> np.NA`` was direct assignment and altered the array.
>
> > BIT PATTERN & MASK IMPLEMENTATIONS FOR NA
> >
> ------------------------------------------------------------------------------------------
> > The current NEP proposes both mask and bit pattern implementations for
> > missing data. I use the terms bit pattern and parameterized dtype
> > interchangeably, since the parameterized dtype will use a bit pattern for
> > its implementation. The two implementations will support the same
> > functionality with respect to NA, and the implementation details will be
> > largely invisible to the user. Their differences are in the 'extra'
> features
> > each supports.
> >
> > Two common questions were:
> > 1. Why make two implementations of missing data: one with masks and the
> > other with parameterized dtypes?
> > 2. Why does the implementation using masks have higher priority?
> > The answers are:
> > 1.  The mask implementation is more general and easier to implement and
> > maintain.  The bit pattern implementation saves memory, makes
> > interoperability easier, and makes ABI (Application Binary Interface)
> > compatibility easier. Since each has different strengths, the argument is
> > both should be implemented.
> > 2. The implementation for the parameterized dtypes will rely on the
> > implementation using a mask.
> >
> > NA VS. IGNORE
> > ---------------------------------
> > A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in
> aNEP
> > sense and NA in  NEP sense. With NA, there is a clear notion of how NA
> > propagates through all basic numpy operations.  (e.g., 3+NA=NA and
> log(NA) =
> > NA, while NA | True = True.) IGNORE is separate from NA, with different
> > interpretations depending on the use case.
> > IGNORE could mean:
> > 1. Data that is being temporarily ignored. e.g., a possible outlier that
> is
> > temporarily being removed from consideration.
> > 2. Data that cannot exist. e.g., a matrix representing a grid of water
> > depths for a lake. Since the lake isn't square, some entries will
> represent
> > land, and so depth will be a meaningless concept for those entries.
> > 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE],
> [IGNORE,
> > 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though
> this
> > leaves open how [1, 2, IGNORE] + [3 , 4] should behave.
> > Because of these different uses of IGNORE, it doesn't have as clear a
> > theoretical interpretation as NA. (For instance, what is IGNORE+3,
> IGNORE*3,
> > or IGNORE | True?)
>
> I don't remember this bit of the discussion, but I see from current
> masked arrays that IGNORE is treated as the identity, so:
>
> IGNORE + 3 = 3
> IGNORE * 3 = 3
>
>
I'd mentioned at the top of my summary that some of the concrete examples
weren't talked about, even though the ideas were. And the fact that IGNORE
doesn't have a computational model behind it was mentioned briefly, though
it wasn't expanded on.

If we follow those rules for IGNORE for all computations, we sometimes get
some weird output. For example:
[ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply
and not * with broadcasting.) Or should that sort of operation through an
error?

> But several of the discussants thought the use cases for IGNORE were very
> > compelling. Specifically, they wanted to be able to use IGNORE's and NA's
> > simultaneously while still being able to differentiate between them. So,
> for
> > example, being able to designate some data as IGNORE while still able to
> > determine which data was NA but not IGNORE. The current NEP does not
> allow
> > for this directly.
>
> I think we discovered that the current NEP is designed to prevent us
> distinguishing between these cases.
>
> I was asking what it was about the implementation (as opposed to the
> API) that influenced the decision to make masked and bit-pattern
> missing data appear to be identical.  I left the conversation before
> the end, but up until that point, had failed to understand.
>

See you,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110706/d661a79a/attachment.html>