[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Wed Jul 6 14:39:37 EDT 2011

On 07/06/2011 08:10 PM, Nathaniel Smith wrote:
> On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> What I'm saying is that Mark's proposal is more flexible. Say for the
>> sake of the argument that I have two codes I need to interface with:
>>
>>   - Library A is written in Fortran and uses a seperate (explicit) mask
>> array for NA
>>
>>   - Library B runs on a GPU and uses a bit pattern for NA
>
> Have you ever encountered any such codes? I'm not aware of any code
> outside of R that implements the proposed NA semantics -- esp. in
> high-performance code, people generally want to avoid lots of
> conditionals, and the proposed NA semantics require a branch around
> every operation inside your inner loops.

I'll admit that this whole thing was an hypothetical exercise.

I've interfaced with Fortran code with NA values -- not a high 
performance case, but not all you interface with is high performance.

> Certainly there is code out there that uses NaNs, and code that uses
> masks (in various ways that might or might not match the way the NEP
> uses them). And it's easy to work with both from numpy right now. The
> question is whether and how the core should add some tricky and subtle
> semantics for a few very specific ways of handling NaN-like objects
> and masking.

I don't disagree with this.

> It's exactly this transparency that worries Matthew and me -- we feel
> that the alterNEP preserves it, and the NEP attempts to erase it. In
> the NEP, there are two totally different underlying data structures,
> but this difference is blurred at the Python level. The idea is that
> you shouldn't have to think about which you have, but if you work with
> C/Fortran, then of course you do have to be constantly aware of the
> underlying implementation anyway. And operations which would obviously
> make sense for the some of the objects that you know you're working
> with (e.g., unmasking elements from a masked array, or even accessing
> the mask directly using numpy slicing) are disallowed, specifically in
> order to make this distinction harder to make.

This worries me too.

What I was thinking is that it could be sort of like indexing -- it 
works OK to have indexing be transparent in Python-land with respect to 
striding, and have a contiguous array be just a special case marked by 
an attribute. If you want, you can still check the strides or flags 
attributes.

> According to the NEP, C code that takes a masked array should never
> ever unmask any element; unmasking should only be done by making a
> full copy of the mask, and attaching it to a new view taken from the
> original array. Would you honestly feel obliged to follow this
> requirement in your C code? Or would you just unmask elements in place
> when it made sense, in order to save memory?

I'm with you on this one: I wouldn't adopt any NumPy feature widely 
unless I had totally transparent access to the underlying implementation 
details from C -- without relying on any NumPy headers (except in my 
Cython wrappers)!

I don't believe in APIs, I believe in standardized binary data.

But I always assumed that could be done down the road, once the internal 
details had stabilized.

As for myself, I'll admit that I'll almost certainly continue with 
explicit masking without using any of the proposed NEPs -- I have to be 
extremely aware of the masks in the statistical methods I use.

Perhaps that's a sign I should withdraw from the discussion.

Dag Sverre