[Numpy-discussion] NA/Missing Data Conference Call Summary

Wed Jul 6 08:27:53 EDT 2011

On 07/06/2011 02:05 PM, Matthew Brett wrote:
> Hi,
>
> Just for reference, I am using this as the latest version of the NEP -
> I hope it's current:
>
> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>
> I'm mostly relaying stuff I said, although generally (please do
> correct me if I am wrong) I am just re-expressing points that
> Nathaniel has already made in the alterNEP text and the emails.
>
> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
> <cjordan1 at uw.edu>  wrote:
> ...
>> Since we only have Mark is only around Austin until early August, there's
>> also broad agreement that we need to get something done quickly.
>
> I think I might have missed that part of the discussion :)
>
> I feel the need to emphasize the centrality of the assertion by
> Nathaniel, and agreement by (at least) me, that the NA case (there
> really is no data) and the IGNORE case (there is data but I'm
> concealing it from you) are conceptually different, and come from
> different use-cases.
>
> The underlying disagreement returned many times to this fundamental
> difference between the NEP and alterNEP:
>
> In the NEP - by design - it is impossible to distinguish between na.NA
> and na.IGNORE
> The alterNEP insists you should be able to distinguish.
>
> Mark says something like "it's all missing data, there's no reason you
> should want to distinguish".  Nathaniel and I were saying "the two
> types of missing do have different use-cases, and it should be
> possible to distinguish.  You might want to chose to treat them the
> same, but you should be able to see what they are.".
>
> I returned several times to this (original point by Nathaniel):
>
> a[3] = np.NA
>
> (what does this mean?   I am altering the underlying array, or a mask?
>    How would I explain this to someone?)
>
> We confirmed that, in order to make it difficult to know what your NA
> is (masked or bit-pattern), Mark has to a) hinder access to the data
> below the mask and b) prevent direct API access to the masking array.
> I described this as 'hobbling the API' and Mark thought of it as
> 'generic programming' (missing is always missing).

Here's an HPC perspective...:

If you, say, want to off-load array processing with a mask to some code 
running on a GPU, you really can't have the GPU go through some NumPy 
API. Or if you want to implement a masked array on a cluster with MPI, 
you similarly really, really want raw access.

At least I feel that the transparency of NumPy is a huge part of its 
current success. Many more than me spend half their time in C/Fortran 
and half their time in Python.

I tend to look at NumPy this way: Assuming you have some data in memory 
(possibly loaded by a C or Fortran library). (Almost) no matter how it 
is allocated, ordered, packed, aligned -- there's a way to find strides 
and dtypes to put a nice NumPy wrapper around it and use the memory from 
Python.

So, my view on Mark's NEP was: With a reasonably amount of flexibility 
in how you decided to implement masking for your data, you can create a 
NumPy wrapper that will understand that. Whether your Fortran library 
exposes NAs in its 40GB buffer as bit patterns, or using a seperate 
mask, both will work.

And IMO Mark's NEP comes rather close to this, you just need an 
additional NEP later to give raw details to the implementation details, 
once those are settled :-)

Dag Sverre