
Hi, On Fri, Oct 28, 2011 at 11:16 AM, Benjamin Root <ben.root@ou.edu> wrote:
On Fri, Oct 28, 2011 at 12:39 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thursday, October 27, 2011, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <oliphant@enthought.com> wrote:
That is a pretty good explanation. I find myself convinced by Matthew's arguments. I think that being able to separate ABSENT from IGNORED is a good idea. I also like being able to control SKIP and PROPAGATE (but I think the current implementation allows this already).
What is the counter-argument to this proposal?
What exactly do you find convincing? The current masks propagate by default:
In [1]: a = ones(5, maskna=1)
In [2]: a[2] = NA
In [3]: a Out[3]: array([ 1., 1., NA, 1., 1.])
In [4]: a + 1 Out[4]: array([ 2., 2., NA, 2., 2.])
In [5]: a[2] = 10
In [5]: a Out[5]: array([ 1., 1., 10., 1., 1.], maskna=True)
I don't see an essential difference between the implementation using masks and one using bit patterns, the mask when attached to the original array just adds a bit pattern by extending all the types by one byte, an approach that easily extends to all existing and future types, which is why Mark went that way for the first implementation given the time available. The masks are hidden because folks wanted something that behaved more like R and also because of the desire to combine the missing, ignore, and later possibly bit patterns in a unified manner. Note that the pseudo assignment was also meant to look like R. Adding true bit patterns to numpy isn't trivial and I believe Mark was thinking of parametrized types for that.
The main problems I see with masks are unified storage and possibly memory use. The rest is just behavor and desired API and that can be adjusted within the current implementation. There is nothing essentially masky about masks.
Chuck
I think chuck sums it up quite nicely. The implementation detail about using mask versus bit patterns can still be discussed and addressed. Personally, I just don't see how parameterized dtypes would be easier to use than the pseudo assignment.
The elegance of mark's solution was to consider the treatment of missing data in a unified manner. This puts missing data in a more prominent spot for extension builders, which should greatly improve support throughout the ecosystem.
Are extension builders then required to use the numpy C API to get their data? Speaking as an extension builder, I would rather you gave me the mask and the bitpattern information and let me do that myself.
Forgive me, I wasn't clear. What I am speaking of is more about a typical human failing. If a programmer for a module never encounters masked arrays, then when they code up a function to operate on numpy data, it is quite likely that they would never take it into consideration. Notice the prolific use of "np.asarray()" even within the numpy codebase, which destroys masked arrays.
Hmm - that sounds like it could cause some surprises. So, what you were saying was just that it was good that masked arrays were now closer to the core? That's reasonable, but I don't think it's relevant to the current discussion. I think we all agree it is nice to have masked arrays in the core.
However, by making missing data support more integral into the core of numpy, then it is far more likely that a programmer would take it into consideration when designing their algorithm, or at least explicitly document that their module does not support missing data. Both NEPs does this by making missing data front-and-center. However, my belief is that Mark's approach is easier to comprehend and is cleaner. Cleaner features means that it is more likely to be used.
The main motivation for the alterNEP was our strong feeling that separating ABSENT and IGNORE was easier to comprehend and cleaner. I think it would be hard to argue that the aterNEP idea is not more explicit.
By letting there be a single missing data framework (instead of two) all that users need to figure out is when they want nan-like behavior (propagate) or to be more like masks (skip). Numpy takes care of the rest. There is a reason why I like using masked arrays because I don't have to use nansum in my library functions to guard against the possibility of receiving nans. Duck-typing is a good thing.
My argument against separating IGNORE and PROPAGATE is that it becomes too tempting to want to mix these in an array, but the desired behavior would likely become ambiguous..
There is one other proplem that I just thought of that I don't think has been outlined in either NEP. What if I perform an operation between an array set up with propagate NAs and an array with skip NAs?
These are explicitly covered in the alterNEP:
Sort of. You speak of reduction operations for a single array with a mix of NA and IGNOREs. I guess in that case, it wouldn't make a difference for element-wise operations between two arrays (plus adding the NAs propagate harder rule). Although, what if skipna=True? I guess I would feel better seeing explicit examples for different combinations of settings (plus, how would one set those for math operators?). In this case, I have a problem with this mixed situation. I would think that IGNORE + NA = IGNORE, because if you are skipping it, then it is skipped, regardless of the other side of the operator. (precedence: a masked array summed against an array of NANs).
I'm using IGNORED as a type of value. What you do to that value depends on what you said to do to that value. You might want to SKIP that type of value, or PROPAGATE. If you said to 'skip' IGNORED but 'propagate' ABSENT, then IGNORED + ABSENT == ABSENT. I think it isn't ambiguous, but I'm happy to be corrected. Best, Matthew