[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Wed Jul 6 10:47:15 EDT 2011

Hi,

On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 07/06/2011 02:46 PM, Matthew Brett wrote:
>> Hi,
>>
>> Sorry, I hope you don't mind, I moved this to it's own thread, trying
>> to separate comments on the NA debate from the discussion yesterday.
>
> I'm sorry.
>
>> On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no>  wrote:
>>> On 07/06/2011 02:05 PM, Matthew Brett wrote:
>>>> Hi,
>>>>
>>>> Just for reference, I am using this as the latest version of the NEP -
>>>> I hope it's current:
>>>>
>>>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>>>>
>>>> I'm mostly relaying stuff I said, although generally (please do
>>>> correct me if I am wrong) I am just re-expressing points that
>>>> Nathaniel has already made in the alterNEP text and the emails.
>>>>
>>>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
>>>> <cjordan1 at uw.edu>    wrote:
>>>> ...
>>>>> Since we only have Mark is only around Austin until early August, there's
>>>>> also broad agreement that we need to get something done quickly.
>>>>
>>>> I think I might have missed that part of the discussion :)
>>>>
>>>> I feel the need to emphasize the centrality of the assertion by
>>>> Nathaniel, and agreement by (at least) me, that the NA case (there
>>>> really is no data) and the IGNORE case (there is data but I'm
>>>> concealing it from you) are conceptually different, and come from
>>>> different use-cases.
>>>>
>>>> The underlying disagreement returned many times to this fundamental
>>>> difference between the NEP and alterNEP:
>>>>
>>>> In the NEP - by design - it is impossible to distinguish between na.NA
>>>> and na.IGNORE
>>>> The alterNEP insists you should be able to distinguish.
>>>>
>>>> Mark says something like "it's all missing data, there's no reason you
>>>> should want to distinguish".  Nathaniel and I were saying "the two
>>>> types of missing do have different use-cases, and it should be
>>>> possible to distinguish.  You might want to chose to treat them the
>>>> same, but you should be able to see what they are.".
>>>>
>>>> I returned several times to this (original point by Nathaniel):
>>>>
>>>> a[3] = np.NA
>>>>
>>>> (what does this mean?   I am altering the underlying array, or a mask?
>>>>     How would I explain this to someone?)
>>>>
>>>> We confirmed that, in order to make it difficult to know what your NA
>>>> is (masked or bit-pattern), Mark has to a) hinder access to the data
>>>> below the mask and b) prevent direct API access to the masking array.
>>>> I described this as 'hobbling the API' and Mark thought of it as
>>>> 'generic programming' (missing is always missing).
>>>
>>> Here's an HPC perspective...:
>>>
>>> If you, say, want to off-load array processing with a mask to some code
>>> running on a GPU, you really can't have the GPU go through some NumPy
>>> API. Or if you want to implement a masked array on a cluster with MPI,
>>> you similarly really, really want raw access.
>>>
>>> At least I feel that the transparency of NumPy is a huge part of its
>>> current success. Many more than me spend half their time in C/Fortran
>>> and half their time in Python.
>>>
>>> I tend to look at NumPy this way: Assuming you have some data in memory
>>> (possibly loaded by a C or Fortran library). (Almost) no matter how it
>>> is allocated, ordered, packed, aligned -- there's a way to find strides
>>> and dtypes to put a nice NumPy wrapper around it and use the memory from
>>> Python.
>>>
>>> So, my view on Mark's NEP was: With a reasonably amount of flexibility
>>> in how you decided to implement masking for your data, you can create a
>>> NumPy wrapper that will understand that. Whether your Fortran library
>>> exposes NAs in its 40GB buffer as bit patterns, or using a seperate
>>> mask, both will work.
>>>
>>> And IMO Mark's NEP comes rather close to this, you just need an
>>> additional NEP later to give raw details to the implementation details,
>>> once those are settled :-)
>>
>> I was a little puzzled as to what you were trying to say, but I
>> suspect that's my ignorance about Numpy internals.
>>
>> Superficially, I would have assumed that, making masked and
>> bit-pattern NAs behave the same in numpy, would take you away from the
>> raw data, in the sense that you not only need the dtype, you also need
>> the mask machinery, in order to know if you have an NA.   Later I
>> realized that you probably weren't saying that.  So, just for my
>> unhappy ignorance - how does the HPC perspective relate to debate
>> about "can / can't distinguish NA from ignore"?
>
> I just commented on the "prevent direct API access to the masking array"
> part -- I'm hoping direct access by external code to the underlying
> implementation details will be allowed, at some point.
>
> What I'm saying is that Mark's proposal is more flexible. Say for the
> sake of the argument that I have two codes I need to interface with:
>
>  - Library A is written in Fortran and uses a seperate (explicit) mask
> array for NA
>
>  - Library B runs on a GPU and uses a bit pattern for NA
>
> Mark's proposal then comes closer to allowing me to wrap both codes
> using NumPy, since it supports both implementation mechanisms. Sure, it
> would need a seperate NEP down the road to extend it, but it goes in the
> right direction for this to happen.

I'm sorry - honestly - maybe it's because I've just had lunch, but I
think I am not understanding something.   When you say "Mark's
proposal is more flexible" - more flexible than what?  I think we
agree that:

* NA bitpatterns are good to have
* masks are good to have

and the discussion is about:

* should it be possible to distinguish between bitpatterns (NAs) and
masks (IGNORE).

Are you saying that making it not-possible to distinguish - at the
numpy level, is more flexible?

Cheers,

Matthew