New subject: HPC missing data - was: NA/Missing Data Conference Call Summary

July 6, 2011

      Hi,

Sorry, I hope you don't mind, I moved this to it's own thread, trying
to separate comments on the NA debate from the discussion yesterday.

On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
<d.s.seljebotn@astro.uio.no> wrote:
...
On 07/06/2011 02:05 PM, Matthew Brett wrote:
...
Hi,
Just for reference, I am using this as the latest version of the NEP -
I hope it's current:
https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b...
I'm mostly relaying stuff I said, although generally (please do
correct me if I am wrong) I am just re-expressing points that
Nathaniel has already made in the alterNEP text and the emails.
On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
<cjordan1@uw.edu>  wrote:
...
...
Since we only have Mark is only around Austin until early August, there's
also broad agreement that we need to get something done quickly.
I think I might have missed that part of the discussion :)
I feel the need to emphasize the centrality of the assertion by
Nathaniel, and agreement by (at least) me, that the NA case (there
really is no data) and the IGNORE case (there is data but I'm
concealing it from you) are conceptually different, and come from
different use-cases.
The underlying disagreement returned many times to this fundamental
difference between the NEP and alterNEP:
In the NEP - by design - it is impossible to distinguish between na.NA
and na.IGNORE
The alterNEP insists you should be able to distinguish.
Mark says something like "it's all missing data, there's no reason you
should want to distinguish".  Nathaniel and I were saying "the two
types of missing do have different use-cases, and it should be
possible to distinguish.  You might want to chose to treat them the
same, but you should be able to see what they are.".
I returned several times to this (original point by Nathaniel):
a[3] = np.NA
(what does this mean?   I am altering the underlying array, or a mask?
   How would I explain this to someone?)
We confirmed that, in order to make it difficult to know what your NA
is (masked or bit-pattern), Mark has to a) hinder access to the data
below the mask and b) prevent direct API access to the masking array.
I described this as 'hobbling the API' and Mark thought of it as
'generic programming' (missing is always missing).
Here's an HPC perspective...:
If you, say, want to off-load array processing with a mask to some code
running on a GPU, you really can't have the GPU go through some NumPy
API. Or if you want to implement a masked array on a cluster with MPI,
you similarly really, really want raw access.
At least I feel that the transparency of NumPy is a huge part of its
current success. Many more than me spend half their time in C/Fortran
and half their time in Python.
I tend to look at NumPy this way: Assuming you have some data in memory
(possibly loaded by a C or Fortran library). (Almost) no matter how it
is allocated, ordered, packed, aligned -- there's a way to find strides
and dtypes to put a nice NumPy wrapper around it and use the memory from
Python.
So, my view on Mark's NEP was: With a reasonably amount of flexibility
in how you decided to implement masking for your data, you can create a
NumPy wrapper that will understand that. Whether your Fortran library
exposes NAs in its 40GB buffer as bit patterns, or using a seperate
mask, both will work.
And IMO Mark's NEP comes rather close to this, you just need an
additional NEP later to give raw details to the implementation details,
once those are settled :-)
I was a little puzzled as to what you were trying to say, but I
suspect that's my ignorance about Numpy internals.

Superficially, I would have assumed that, making masked and
bit-pattern NAs behave the same in numpy, would take you away from the
raw data, in the sense that you not only need the dtype, you also need
the mask machinery, in order to know if you have an NA.   Later I
realized that you probably weren't saying that.  So, just for my
unhappy ignorance - how does the HPC perspective relate to debate
about "can / can't distinguish NA from ignore"?

Sorry, thanks,

Matthew

HPC missing data - was: NA/Missing Data Conference Call Summary

Matthew Brett

Dag Sverre Seljebotn

Matthew Brett

Dag Sverre Seljebotn

Nathaniel Smith

Dag Sverre Seljebotn

Gael Varoquaux

Mark Wiebe

tags

participants (5)