HPC missing data - was: NA/Missing Data Conference Call Summary
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday. On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,
Just for reference, I am using this as the latest version of the NEP - I hope it's current:
https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b...
I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails.
On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote: ...
Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly.
I think I might have missed that part of the discussion :)
I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases.
The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP:
In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish.
Mark says something like "it's all missing data, there's no reason you should want to distinguish". Nathaniel and I were saying "the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.".
I returned several times to this (original point by Nathaniel):
a[3] = np.NA
(what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?)
We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing).
Here's an HPC perspective...:
If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access.
At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python.
I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python.
So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work.
And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-)
I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals. Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about "can / can't distinguish NA from ignore"? Sorry, thanks, Matthew
![](https://secure.gravatar.com/avatar/723b49f8d57b46f753cc4097459cbcdb.jpg?s=120&d=mm&r=g)
On 07/06/2011 02:46 PM, Matthew Brett wrote:
Hi,
Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday.
I'm sorry.
On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,
Just for reference, I am using this as the latest version of the NEP - I hope it's current:
https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b...
I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails.
On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote: ...
Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly.
I think I might have missed that part of the discussion :)
I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases.
The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP:
In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish.
Mark says something like "it's all missing data, there's no reason you should want to distinguish". Nathaniel and I were saying "the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.".
I returned several times to this (original point by Nathaniel):
a[3] = np.NA
(what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?)
We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing).
Here's an HPC perspective...:
If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access.
At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python.
I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python.
So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work.
And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-)
I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals.
Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about "can / can't distinguish NA from ignore"?
I just commented on the "prevent direct API access to the masking array" part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point. What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with: - Library A is written in Fortran and uses a seperate (explicit) mask array for NA - Library B runs on a GPU and uses a bit pattern for NA Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen. As for NA vs. IGNORE I still think 2 types is too little. One should allow for 255 different NA-values, each with user-defined behaviour. Again, Mark's proposal then makes a good start on that, even if more work would be needed to make it happen. I.e., in my perfect world I'd do this to wrap library A (Cythonish psuedo-code: def call_lib_A(): ... lib_A_function(arraybuf, maskbuf, ...) DOG_ATE_IT = np.NA("DOG_ATE_IT", value=42, behaviour="raise") # behaviour could also be "zero", "invalid" missing_value_map = {0xAF: np.NA, 0x43: np.IGNORE, 0xF0: DOG_ATE_IT} result = np.PyArray_CreateArrayFromBufferWithMaskBuffer( arraybuf, maskbuf, missing_value_map, ...) return result def call_lib_B(): lib_B_function(arraybuf, ...) missing_value_patterns = {0xFFFFCACA : np.NA} result = np.PyArray_CreateArrayFromBufferWithBitPattern( arraybuf, maskbuf, missing_value_patterns, ...) return result Hope that is clearer. Again, my intention is not to suggest even more work at the present stage, just to state some advantages with the general direction of Mark's proposal. Dag Sverre
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 07/06/2011 02:46 PM, Matthew Brett wrote:
Hi,
Sorry, I hope you don't mind, I moved this to it's own thread, trying to separate comments on the NA debate from the discussion yesterday.
I'm sorry.
On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 07/06/2011 02:05 PM, Matthew Brett wrote:
Hi,
Just for reference, I am using this as the latest version of the NEP - I hope it's current:
https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b...
I'm mostly relaying stuff I said, although generally (please do correct me if I am wrong) I am just re-expressing points that Nathaniel has already made in the alterNEP text and the emails.
On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote: ...
Since we only have Mark is only around Austin until early August, there's also broad agreement that we need to get something done quickly.
I think I might have missed that part of the discussion :)
I feel the need to emphasize the centrality of the assertion by Nathaniel, and agreement by (at least) me, that the NA case (there really is no data) and the IGNORE case (there is data but I'm concealing it from you) are conceptually different, and come from different use-cases.
The underlying disagreement returned many times to this fundamental difference between the NEP and alterNEP:
In the NEP - by design - it is impossible to distinguish between na.NA and na.IGNORE The alterNEP insists you should be able to distinguish.
Mark says something like "it's all missing data, there's no reason you should want to distinguish". Nathaniel and I were saying "the two types of missing do have different use-cases, and it should be possible to distinguish. You might want to chose to treat them the same, but you should be able to see what they are.".
I returned several times to this (original point by Nathaniel):
a[3] = np.NA
(what does this mean? I am altering the underlying array, or a mask? How would I explain this to someone?)
We confirmed that, in order to make it difficult to know what your NA is (masked or bit-pattern), Mark has to a) hinder access to the data below the mask and b) prevent direct API access to the masking array. I described this as 'hobbling the API' and Mark thought of it as 'generic programming' (missing is always missing).
Here's an HPC perspective...:
If you, say, want to off-load array processing with a mask to some code running on a GPU, you really can't have the GPU go through some NumPy API. Or if you want to implement a masked array on a cluster with MPI, you similarly really, really want raw access.
At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python.
I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python.
So, my view on Mark's NEP was: With a reasonably amount of flexibility in how you decided to implement masking for your data, you can create a NumPy wrapper that will understand that. Whether your Fortran library exposes NAs in its 40GB buffer as bit patterns, or using a seperate mask, both will work.
And IMO Mark's NEP comes rather close to this, you just need an additional NEP later to give raw details to the implementation details, once those are settled :-)
I was a little puzzled as to what you were trying to say, but I suspect that's my ignorance about Numpy internals.
Superficially, I would have assumed that, making masked and bit-pattern NAs behave the same in numpy, would take you away from the raw data, in the sense that you not only need the dtype, you also need the mask machinery, in order to know if you have an NA. Later I realized that you probably weren't saying that. So, just for my unhappy ignorance - how does the HPC perspective relate to debate about "can / can't distinguish NA from ignore"?
I just commented on the "prevent direct API access to the masking array" part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point.
What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen.
I'm sorry - honestly - maybe it's because I've just had lunch, but I think I am not understanding something. When you say "Mark's proposal is more flexible" - more flexible than what? I think we agree that: * NA bitpatterns are good to have * masks are good to have and the discussion is about: * should it be possible to distinguish between bitpatterns (NAs) and masks (IGNORE). Are you saying that making it not-possible to distinguish - at the numpy level, is more flexible? Cheers, Matthew
![](https://secure.gravatar.com/avatar/723b49f8d57b46f753cc4097459cbcdb.jpg?s=120&d=mm&r=g)
On 07/06/2011 04:47 PM, Matthew Brett wrote:
Hi,
On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
I just commented on the "prevent direct API access to the masking array" part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point.
What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen.
I'm sorry - honestly - maybe it's because I've just had lunch, but I think I am not understanding something. When you say "Mark's proposal is more flexible" - more flexible than what? I think we agree that:
* NA bitpatterns are good to have * masks are good to have
and the discussion is about:
* should it be possible to distinguish between bitpatterns (NAs) and masks (IGNORE).
I guess I just don't agree with these definitions. There's (NA, IGNORE), and there's (bitpatterns, masks); these are in principle orthogonal. It is possible (and perhaps reasonable) to hard-wire them they way you say -- that may be more obvious, user-friendly, etc., but it is not more flexible. Both Mark and Chuck have explicitly supported having many different NA types down the road (thread: "An NA compromise idea -- many-NA"). So the main difference to me seems to be that you want to hard-wire the NA type and the representation in a specific configuration. I may be missing something though.
Are you saying that making it not-possible to distinguish - at the numpy level, is more flexible?
I'm OK with the "common" ways of accessing data to not distinguish, as long as there's some poweruser way around it. Just like strides -- you index a strided array just like a contiguous array, but you can peek inside into the implementation if you want. Dag Sverre
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Have you ever encountered any such codes? I'm not aware of any code outside of R that implements the proposed NA semantics -- esp. in high-performance code, people generally want to avoid lots of conditionals, and the proposed NA semantics require a branch around every operation inside your inner loops. Certainly there is code out there that uses NaNs, and code that uses masks (in various ways that might or might not match the way the NEP uses them). And it's easy to work with both from numpy right now. The question is whether and how the core should add some tricky and subtle semantics for a few very specific ways of handling NaN-like objects and masking. Upthread you also wrote:
At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python.
It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. And operations which would obviously make sense for the some of the objects that you know you're working with (e.g., unmasking elements from a masked array, or even accessing the mask directly using numpy slicing) are disallowed, specifically in order to make this distinction harder to make. According to the NEP, C code that takes a masked array should never ever unmask any element; unmasking should only be done by making a full copy of the mask, and attaching it to a new view taken from the original array. Would you honestly feel obliged to follow this requirement in your C code? Or would you just unmask elements in place when it made sense, in order to save memory? -- Nathaniel
![](https://secure.gravatar.com/avatar/723b49f8d57b46f753cc4097459cbcdb.jpg?s=120&d=mm&r=g)
On 07/06/2011 08:10 PM, Nathaniel Smith wrote:
On Wed, Jul 6, 2011 at 6:12 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Have you ever encountered any such codes? I'm not aware of any code outside of R that implements the proposed NA semantics -- esp. in high-performance code, people generally want to avoid lots of conditionals, and the proposed NA semantics require a branch around every operation inside your inner loops.
I'll admit that this whole thing was an hypothetical exercise. I've interfaced with Fortran code with NA values -- not a high performance case, but not all you interface with is high performance.
Certainly there is code out there that uses NaNs, and code that uses masks (in various ways that might or might not match the way the NEP uses them). And it's easy to work with both from numpy right now. The question is whether and how the core should add some tricky and subtle semantics for a few very specific ways of handling NaN-like objects and masking.
I don't disagree with this.
It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. And operations which would obviously make sense for the some of the objects that you know you're working with (e.g., unmasking elements from a masked array, or even accessing the mask directly using numpy slicing) are disallowed, specifically in order to make this distinction harder to make.
This worries me too. What I was thinking is that it could be sort of like indexing -- it works OK to have indexing be transparent in Python-land with respect to striding, and have a contiguous array be just a special case marked by an attribute. If you want, you can still check the strides or flags attributes.
According to the NEP, C code that takes a masked array should never ever unmask any element; unmasking should only be done by making a full copy of the mask, and attaching it to a new view taken from the original array. Would you honestly feel obliged to follow this requirement in your C code? Or would you just unmask elements in place when it made sense, in order to save memory?
I'm with you on this one: I wouldn't adopt any NumPy feature widely unless I had totally transparent access to the underlying implementation details from C -- without relying on any NumPy headers (except in my Cython wrappers)! I don't believe in APIs, I believe in standardized binary data. But I always assumed that could be done down the road, once the internal details had stabilized. As for myself, I'll admit that I'll almost certainly continue with explicit masking without using any of the proposed NEPs -- I have to be extremely aware of the masks in the statistical methods I use. Perhaps that's a sign I should withdraw from the discussion. Dag Sverre
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
On Wed, Jul 06, 2011 at 08:39:37PM +0200, Dag Sverre Seljebotn wrote:
As for myself, I'll admit that I'll almost certainly continue with explicit masking without using any of the proposed NEPs -- I have to be extremely aware of the masks in the statistical methods I use.
My gut feeling is that I am in the same case. G
![](https://secure.gravatar.com/avatar/72902e7adf1c8f5b524c04a15cc3c6a5.jpg?s=120&d=mm&r=g)
On Wed, Jul 6, 2011 at 8:12 AM, Dag Sverre Seljebotn < d.s.seljebotn@astro.uio.no> wrote:
<snip> I just commented on the "prevent direct API access to the masking array" part -- I'm hoping direct access by external code to the underlying implementation details will be allowed, at some point.
I think direct or nearly direct access needs to be in right away, unless we're fairly sure that we will change low level implementation details in the near future. I've added "Python API" and "C API" definitions for us to use to try and clear up this kind of potential confusion. -Mark
What I'm saying is that Mark's proposal is more flexible. Say for the sake of the argument that I have two codes I need to interface with:
- Library A is written in Fortran and uses a seperate (explicit) mask array for NA
- Library B runs on a GPU and uses a bit pattern for NA
Mark's proposal then comes closer to allowing me to wrap both codes using NumPy, since it supports both implementation mechanisms. Sure, it would need a seperate NEP down the road to extend it, but it goes in the right direction for this to happen.
As for NA vs. IGNORE I still think 2 types is too little. One should allow for 255 different NA-values, each with user-defined behaviour. Again, Mark's proposal then makes a good start on that, even if more work would be needed to make it happen.
I.e., in my perfect world I'd do this to wrap library A (Cythonish psuedo-code:
def call_lib_A(): ... lib_A_function(arraybuf, maskbuf, ...) DOG_ATE_IT = np.NA("DOG_ATE_IT", value=42, behaviour="raise") # behaviour could also be "zero", "invalid" missing_value_map = {0xAF: np.NA, 0x43: np.IGNORE, 0xF0: DOG_ATE_IT} result = np.PyArray_CreateArrayFromBufferWithMaskBuffer( arraybuf, maskbuf, missing_value_map, ...) return result
def call_lib_B(): lib_B_function(arraybuf, ...) missing_value_patterns = {0xFFFFCACA : np.NA} result = np.PyArray_CreateArrayFromBufferWithBitPattern( arraybuf, maskbuf, missing_value_patterns, ...) return result
Hope that is clearer. Again, my intention is not to suggest even more work at the present stage, just to state some advantages with the general direction of Mark's proposal.
Dag Sverre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Dag Sverre Seljebotn
-
Gael Varoquaux
-
Mark Wiebe
-
Matthew Brett
-
Nathaniel Smith