[Numpy-discussion] NA/Missing Data Conference Call Summary

Wed Jul 6 08:05:03 EDT 2011

Hi,

Just for reference, I am using this as the latest version of the NEP -
I hope it's current:

https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst

I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.

On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
<cjordan1 at uw.edu> wrote:
...
> Since we only have Mark is only around Austin until early August, there's
> also broad agreement that we need to get something done quickly.

I think I might have missed that part of the discussion :)

I feel the need to emphasize the centrality of the assertion by
Nathaniel, and agreement by (at least) me, that the NA case (there
really is no data) and the IGNORE case (there is data but I'm
concealing it from you) are conceptually different, and come from
different use-cases.

The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:

In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.

Mark says something like "it's all missing data, there's no reason you
should want to distinguish".  Nathaniel and I were saying "the two
types of missing do have different use-cases, and it should be
possible to distinguish.  You might want to chose to treat them the
same, but you should be able to see what they are.".

I returned several times to this (original point by Nathaniel):

a[3] = np.NA

(what does this mean?   I am altering the underlying array, or a mask?
  How would I explain this to someone?)

We confirmed that, in order to make it difficult to know what your NA
is (masked or bit-pattern), Mark has to a) hinder access to the data
below the mask and b) prevent direct API access to the masking array.
I described this as 'hobbling the API' and Mark thought of it as
'generic programming' (missing is always missing).

I asserted that explaining NA to people would be easier if ``a[3] =
np.NA`` was direct assignment and altered the array.

> BIT PATTERN & MASK IMPLEMENTATIONS FOR NA
> ------------------------------------------------------------------------------------------
> The current NEP proposes both mask and bit pattern implementations for
> missing data. I use the terms bit pattern and parameterized dtype
> interchangeably, since the parameterized dtype will use a bit pattern for
> its implementation. The two implementations will support the same
> functionality with respect to NA, and the implementation details will be
> largely invisible to the user. Their differences are in the 'extra' features
> each supports.
>
> Two common questions were:
> 1. Why make two implementations of missing data: one with masks and the
> other with parameterized dtypes?
> 2. Why does the implementation using masks have higher priority?
> The answers are:
> 1.  The mask implementation is more general and easier to implement and
> maintain.  The bit pattern implementation saves memory, makes
> interoperability easier, and makes ABI (Application Binary Interface)
> compatibility easier. Since each has different strengths, the argument is
> both should be implemented.
> 2. The implementation for the parameterized dtypes will rely on the
> implementation using a mask.
>
> NA VS. IGNORE
> ---------------------------------
> A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP
> sense and NA in  NEP sense. With NA, there is a clear notion of how NA
> propagates through all basic numpy operations.  (e.g., 3+NA=NA and log(NA) =
> NA, while NA | True = True.) IGNORE is separate from NA, with different
> interpretations depending on the use case.
> IGNORE could mean:
> 1. Data that is being temporarily ignored. e.g., a possible outlier that is
> temporarily being removed from consideration.
> 2. Data that cannot exist. e.g., a matrix representing a grid of water
> depths for a lake. Since the lake isn't square, some entries will represent
> land, and so depth will be a meaningless concept for those entries.
> 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
> 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
> leaves open how [1, 2, IGNORE] + [3 , 4] should behave.
> Because of these different uses of IGNORE, it doesn't have as clear a
> theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
> or IGNORE | True?)

I don't remember this bit of the discussion, but I see from current
masked arrays that IGNORE is treated as the identity, so:

IGNORE + 3 = 3
IGNORE * 3 = 3

> But several of the discussants thought the use cases for IGNORE were very
> compelling. Specifically, they wanted to be able to use IGNORE's and NA's
> simultaneously while still being able to differentiate between them. So, for
> example, being able to designate some data as IGNORE while still able to
> determine which data was NA but not IGNORE. The current NEP does not allow
> for this directly.

I think we discovered that the current NEP is designed to prevent us
distinguishing between these cases.

I was asking what it was about the implementation (as opposed to the
API) that influenced the decision to make masked and bit-pattern
missing data appear to be identical.  I left the conversation before
the end, but up until that point, had failed to understand.

See you,

Matthew