Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

July 6, 2011

      On 07/06/2011 08:10 PM, Nathaniel Smith wrote:
...
On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn
<d.s.seljebotn@astro.uio.no>  wrote:
...
What I'm saying is that Mark's proposal is more flexible. Say for the
sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask
array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Have you ever encountered any such codes? I'm not aware of any code
outside of R that implements the proposed NA semantics -- esp. in
high-performance code, people generally want to avoid lots of
conditionals, and the proposed NA semantics require a branch around
every operation inside your inner loops.
I'll admit that this whole thing was an hypothetical exercise.

I've interfaced with Fortran code with NA values -- not a high 
performance case, but not all you interface with is high performance.
...
Certainly there is code out there that uses NaNs, and code that uses
masks (in various ways that might or might not match the way the NEP
uses them). And it's easy to work with both from numpy right now. The
question is whether and how the core should add some tricky and subtle
semantics for a few very specific ways of handling NaN-like objects
and masking.
I don't disagree with this.
...
It's exactly this transparency that worries Matthew and me -- we feel
that the alterNEP preserves it, and the NEP attempts to erase it. In
the NEP, there are two totally different underlying data structures,
but this difference is blurred at the Python level. The idea is that
you shouldn't have to think about which you have, but if you work with
C/Fortran, then of course you do have to be constantly aware of the
underlying implementation anyway. And operations which would obviously
make sense for the some of the objects that you know you're working
with (e.g., unmasking elements from a masked array, or even accessing
the mask directly using numpy slicing) are disallowed, specifically in
order to make this distinction harder to make.
This worries me too.

What I was thinking is that it could be sort of like indexing -- it 
works OK to have indexing be transparent in Python-land with respect to 
striding, and have a contiguous array be just a special case marked by 
an attribute. If you want, you can still check the strides or flags 
attributes.
...
According to the NEP, C code that takes a masked array should never
ever unmask any element; unmasking should only be done by making a
full copy of the mask, and attaching it to a new view taken from the
original array. Would you honestly feel obliged to follow this
requirement in your C code? Or would you just unmask elements in place
when it made sense, in order to save memory?
I'm with you on this one: I wouldn't adopt any NumPy feature widely 
unless I had totally transparent access to the underlying implementation 
details from C -- without relying on any NumPy headers (except in my 
Cython wrappers)!

I don't believe in APIs, I believe in standardized binary data.

But I always assumed that could be done down the road, once the internal 
details had stabilized.

As for myself, I'll admit that I'll almost certainly continue with 
explicit masking without using any of the proposed NEPs -- I have to be 
extremely aware of the masks in the statistical methods I use.

Perhaps that's a sign I should withdraw from the discussion.

Dag Sverre