[Numpy-discussion] NA/Missing Data Conference Call Summary

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Wed Jul 6 08:31:44 EDT 2011


On 07/06/2011 02:27 PM, Dag Sverre Seljebotn wrote:
> On 07/06/2011 02:05 PM, Matthew Brett wrote:
>> Hi,
>>
>> Just for reference, I am using this as the latest version of the NEP -
>> I hope it's current:
>>
>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>>
>> I'm mostly relaying stuff I said, although generally (please do
>> correct me if I am wrong) I am just re-expressing points that
>> Nathaniel has already made in the alterNEP text and the emails.
>>
>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
>> <cjordan1 at uw.edu>   wrote:
>> ...
>>> Since we only have Mark is only around Austin until early August, there's
>>> also broad agreement that we need to get something done quickly.
>>
>> I think I might have missed that part of the discussion :)
>>
>> I feel the need to emphasize the centrality of the assertion by
>> Nathaniel, and agreement by (at least) me, that the NA case (there
>> really is no data) and the IGNORE case (there is data but I'm
>> concealing it from you) are conceptually different, and come from
>> different use-cases.
>>
>> The underlying disagreement returned many times to this fundamental
>> difference between the NEP and alterNEP:
>>
>> In the NEP - by design - it is impossible to distinguish between na.NA
>> and na.IGNORE
>> The alterNEP insists you should be able to distinguish.
>>
>> Mark says something like "it's all missing data, there's no reason you
>> should want to distinguish".  Nathaniel and I were saying "the two
>> types of missing do have different use-cases, and it should be
>> possible to distinguish.  You might want to chose to treat them the
>> same, but you should be able to see what they are.".
>>
>> I returned several times to this (original point by Nathaniel):
>>
>> a[3] = np.NA
>>
>> (what does this mean?   I am altering the underlying array, or a mask?
>>     How would I explain this to someone?)
>>
>> We confirmed that, in order to make it difficult to know what your NA
>> is (masked or bit-pattern), Mark has to a) hinder access to the data
>> below the mask and b) prevent direct API access to the masking array.
>> I described this as 'hobbling the API' and Mark thought of it as
>> 'generic programming' (missing is always missing).
>
> Here's an HPC perspective...:
>
> If you, say, want to off-load array processing with a mask to some code
> running on a GPU, you really can't have the GPU go through some NumPy
> API. Or if you want to implement a masked array on a cluster with MPI,
> you similarly really, really want raw access.
>
> At least I feel that the transparency of NumPy is a huge part of its
> current success. Many more than me spend half their time in C/Fortran
> and half their time in Python.
>
> I tend to look at NumPy this way: Assuming you have some data in memory
> (possibly loaded by a C or Fortran library). (Almost) no matter how it
> is allocated, ordered, packed, aligned -- there's a way to find strides
> and dtypes to put a nice NumPy wrapper around it and use the memory from
> Python.
>
> So, my view on Mark's NEP was: With a reasonably amount of flexibility
> in how you decided to implement masking for your data, you can create a
> NumPy wrapper that will understand that. Whether your Fortran library
> exposes NAs in its 40GB buffer as bit patterns, or using a seperate
> mask, both will work.
>
> And IMO Mark's NEP comes rather close to this, you just need an
> additional NEP later to give raw details to the implementation details,
> once those are settled :-)

To be concrete, I'm thinking something like a custom extension to PEP 
3118, which could also allow efficient access from Cython without 
hard-coding Cython for NumPy (a GSoC project this summer will continue 
to move us away from the "np.ndarray[int]" syntax to a more generic 
"int[:]" that's less tied to NumPy).

But first things first!

Dag Sverre



More information about the NumPy-Discussion mailing list