Mailman 3 alterNEP - was: missing data discussion round 2 - NumPy-Discussion

newer
Broadcasting shape mismatch...

alterNEP - was: missing data discussion round 2

older
PyCon DE 2011 - Call for Proposals...

Matthew Brett

June 30, 2011

1:31 p.m.

Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

Anyway, it's pretty clear that in this particular case, there are two distinct features that different people want: the missing data feature, and the masked array feature. The more I think about it, the less I see how they can be combined into one dessert topping + floor wax solution. Here are three particular points where they seem to contradict each other: ... [some proposals]

In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: ############################################### A alternative-NEP on masking and missing values ############################################### The principle of this aNEP is to separate the APIs for masking and for missing values, according to * The current implementation of masked arrays * Nathaniel Smith's proposal. This discussion is only of the API, and not of the implementation. ************** Initialization ************** First, missing values can be set and be displayed as ``np.NA, NA``:: >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]') As the initialization is not ambiguous, this can be written without the NA dtype:: >>> np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]') Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True) As the initialization is not ambiguous, this can be written without ``masked=True``:: >>> np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True) ****** Ufuncs ****** By default, NA values propagate:: >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) >>> np.sum(na_arr) NA('float64') unless the ``skipna`` flag is set:: >>> np.sum(na_arr, skipna=True) 10.0 By default, masking does not propagate:: >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) >>> np.sum(masked_arr) 10.0 unless the ``propmsk`` flag is set:: >>> np.sum(masked_arr, propmsk=True) MASKED An array can be masked, and contain NA values:: >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) In the default case, the behavior is obvious:: >>> np.sum(both_arr) NA('float64') It's also obvious what to do with ``skipna=True``:: >>> np.sum(both_arr, skipna=True) 10.0 >>> np.sum(both_arr, skipna=True, propmsk=True) MASKED To break the tie between NA and MSK, NAs propagate harder:: >>> np.sum(both_arr, propmsk=True) NA('float64') ********** Assignment ********** is obvious in the NA case:: >>> arr = np.array([1.0, 2.0, 7.0]) >>> arr[2] = np.NA TypeError('dtype does not support NA') >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>> na_arr[2] = np.NA >>> na_arr array([1., 2., NA], dtype='NA[<f8]') Direct assignnent in the masked case is magic and confusing, and so happens only via the mask:: >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>> masked_arr[2] = np.NA TypeError('dtype does not support NA') >>> masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') >>> masked_arr.visible[2] = False >>> masked_arr array([1., 2., MASKED], masked=True) See y'all, Matthew

Show replies by date

Pierre GM

June 2011

1:58 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote:

...

############################################### A alternative-NEP on masking and missing values ###############################################

I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way.

...

************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here.

...

****** Ufuncs ******

All fine.

...

********** Assignment **********

is obvious in the NA case::

...
...
...
arr = np.array([1.0, 2.0, 7.0]) arr[2] = np.NA TypeError('dtype does not support NA') na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') na_arr[2] = np.NA na_arr array([1., 2., NA], dtype='NA[<f8]')

...

Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
...
masked_array = np.array([1.0, 2.0, 7.0], masked=True) masked_arr[2] = np.NA TypeError('dtype does not support NA') masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') masked_arr.visible[2] = False masked_arr array([1., 2., MASKED], masked=True)

What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ?

Matthew Brett

3:38 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote:

...
############################################### A alternative-NEP on masking and missing values ###############################################

I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way.

...
************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here.

If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then obviously I mean it should be masked, so the 'masked=True' here is completely redundant, yes, I agree. And in fact: np.array([1.0, 2.0, np.MASKED, 7.0], masked=False) should raise an error. On the other hand, if I make a normal array: arr = np.array([1.0, 2.0, 7.0]) and then do this: arr.visible[2] = False then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly. This somewhat breaks expectations though, because you might just have made a largish mask array without having any clue that that had happened.

...

...
****** Ufuncs ******

All fine.

...
********** Assignment **********

is obvious in the NA case::

...
...
...
arr = np.array([1.0, 2.0, 7.0]) arr[2] = np.NA TypeError('dtype does not support NA') na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') na_arr[2] = np.NA na_arr array([1., 2., NA], dtype='NA[<f8]')

OK

...
Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
...
masked_array = np.array([1.0, 2.0, 7.0], masked=True) masked_arr[2] = np.NA TypeError('dtype does not support NA') masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') masked_arr.visible[2] = False masked_arr array([1., 2., MASKED], masked=True)

What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ?

Well, for the np.NA case, this is straightforward: na_arr[2] = 3 It's just assignment. For ``masked_array[2] = 3`` - I don't know, I guess whatever we are used to. What do you think? Best, Matthew

Pierre GM

4:03 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Jun 30, 2011, at 5:38 PM, Matthew Brett wrote:

...

Hi,

On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...
On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote:

...
############################################### A alternative-NEP on masking and missing values ###############################################

I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way.

...
************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here.

If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then obviously I mean it should be masked, so the 'masked=True' here is completely redundant, yes, I agree. And in fact:

np.array([1.0, 2.0, np.MASKED, 7.0], masked=False)

should raise an error. On the other hand, if I make a normal array:

arr = np.array([1.0, 2.0, 7.0])

and then do this:

arr.visible[2] = False

then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly. This somewhat breaks expectations though, because you might just have made a largish mask array without having any clue that that had happened.

Well, I'd expect an error to be raised when assigning a NA if the initial array is not NA friendly. The 'magical' creation of a mask would be nice, but is probably too magic and best left alone.

...

...
...
Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
...
masked_array = np.array([1.0, 2.0, 7.0], masked=True) masked_arr[2] = np.NA TypeError('dtype does not support NA') masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') masked_arr.visible[2] = False masked_arr array([1., 2., MASKED], masked=True)

What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ?

Well, for the np.NA case, this is straightforward:

na_arr[2] = 3

It's just assignment. For ``masked_array[2] = 3`` - I don't know, I guess whatever we are used to. What do you think?

Ahah, that depends. With a = np.array([1., np.NA, 3.]), then a[1]=2. should raise an error, as Mark suggests: you can't "unmask" a missing value, you need to create a view of the initial array then "unmask". It's the equivalent of a hard mask. With a = np.array([1., np.IGNORE, 3.]), then a[1]=2. should give np.array([1.,2.,3.]) and a.mask=[False,False,False]. That's a soft mask. At least, that's how I see it... P.

Matthew Brett

4:30 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Thu, Jun 30, 2011 at 5:03 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

On Jun 30, 2011, at 5:38 PM, Matthew Brett wrote:

...
Hi,

On Thu, Jun 30, 2011 at 2:58 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...
On Jun 30, 2011, at 3:31 PM, Matthew Brett wrote:

...
############################################### A alternative-NEP on masking and missing values ###############################################

I like the idea of two different special values, np.NA for missing values, np.IGNORE for masked values. np.NA values in an array define what was implemented in numpy.ma as a 'hard mask' (where you can't unmask data), while np.IGNOREs correspond to the .mask in numpy.ma. Looks fairly non ambiguous that way.

...
************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
> np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
> np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

I'm not happy with this 'masked' parameter, at all. What's the point? Either you have np.NAs and/or np.IGNOREs or you don't. I'm probably missing something here.

If I put np.MASKED (I agree I prefer np.IGNORE) in the init, then obviously I mean it should be masked, so the 'masked=True' here is completely redundant, yes, I agree. And in fact:

np.array([1.0, 2.0, np.MASKED, 7.0], masked=False)

should raise an error. On the other hand, if I make a normal array:

arr = np.array([1.0, 2.0, 7.0])

and then do this:

arr.visible[2] = False

then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly. This somewhat breaks expectations though, because you might just have made a largish mask array without having any clue that that had happened.

Well, I'd expect an error to be raised when assigning a NA if the initial array is not NA friendly. The 'magical' creation of a mask would be nice, but is probably too magic and best left alone.

I agree :)

...

...
...
...
Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
> masked_array = np.array([1.0, 2.0, 7.0], masked=True) > masked_arr[2] = np.NA TypeError('dtype does not support NA') > masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') > masked_arr.visible[2] = False > masked_arr array([1., 2., MASKED], masked=True)

What about the reverse case ? When you assign a regular value to a np.NA/np.IGNORE item ?

Well, for the np.NA case, this is straightforward:

na_arr[2] = 3

It's just assignment. For ``masked_array[2] = 3`` - I don't know, I guess whatever we are used to. What do you think?

Ahah, that depends. With a = np.array([1., np.NA, 3.]), then a[1]=2. should raise an error, as Mark suggests: you can't "unmask" a missing value, you need to create a view of the initial array then "unmask". It's the equivalent of a hard mask.

In this alterNEP, the NAs and the masked values are completely different. So, if you do this: a = np.array([1., np.NA, 3.]) then you've unambiguously asked for an array that can handle floats and NAs, and that will be the NA[<f8] dtype by default. You didn't ask for a masked array, you asked for an array that can carry NAs. You can't unmask an NA, because an NA isn't a masked value, it's an NA. So, if you do: a[1] = 2 you just mean 'change the NA in position [1] to the value 2'. Simple as that.

...

With a = np.array([1., np.IGNORE, 3.]), then a[1]=2. should give np.array([1.,2.,3.]) and a.mask=[False,False,False]. That's a soft mask.

Sounds reasonable to me... Cheers, Matthew

Christopher Barker

July 2011

3:29 p.m.

New subject: alterNEP - was: missing data discussion round 2

Matthew Brett wrote:

...

should raise an error. On the other hand, if I make a normal array:

arr = np.array([1.0, 2.0, 7.0])

and then do this:

arr.visible[2] = False

then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly.

maybe it's too much Magic, but it seems reasonable to me that for an array without a mask, arr.visible[i] is simply True for all values of i -- no need to create a mask to determine that. does arr[i] = np.IGNORE auto-create a mask if there is not one there already? I think it should. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Charles R Harris

June 2011

2:17 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
Anyway, it's pretty clear that in this particular case, there are two distinct features that different people want: the missing data feature, and the masked array feature. The more I think about it, the less I see how they can be combined into one dessert topping + floor wax solution. Here are three particular points where they seem to contradict each other: ... [some proposals]

In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood:

############################################### A alternative-NEP on masking and missing values ###############################################

The principle of this aNEP is to separate the APIs for masking and for missing values, according to

* The current implementation of masked arrays * Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.

************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

****** Ufuncs ******

By default, NA values propagate::

...
...
...
na_arr = np.array([1.0, 2.0, np.NA, 7.0]) np.sum(na_arr) NA('float64')

unless the ``skipna`` flag is set::

...
...
...
np.sum(na_arr, skipna=True) 10.0

By default, masking does not propagate::

...
...
...
masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) np.sum(masked_arr) 10.0

unless the ``propmsk`` flag is set::

...
...
...
np.sum(masked_arr, propmsk=True) MASKED

An array can be masked, and contain NA values::

...
...
...
both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])

In the default case, the behavior is obvious::

...
...
...
np.sum(both_arr) NA('float64')

It's also obvious what to do with ``skipna=True``::

...
...
...
np.sum(both_arr, skipna=True) 10.0 np.sum(both_arr, skipna=True, propmsk=True) MASKED

To break the tie between NA and MSK, NAs propagate harder::

...
...
...
np.sum(both_arr, propmsk=True) NA('float64')

********** Assignment **********

is obvious in the NA case::

...
...
...
arr = np.array([1.0, 2.0, 7.0]) arr[2] = np.NA TypeError('dtype does not support NA') na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') na_arr[2] = np.NA na_arr array([1., 2., NA], dtype='NA[<f8]')

Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
...
masked_array = np.array([1.0, 2.0, 7.0], masked=True) masked_arr[2] = np.NA TypeError('dtype does not support NA') masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') masked_arr.visible[2] = False masked_arr array([1., 2., MASKED], masked=True)

See y'all,

I honestly don't see the problem here. The difference isn't between masked_values/missing_values, it is between masked arrays and masked views of unmasked arrays. I think the view concept is central to what is going on. It may not be what folks are used to, but it strikes me as a clarifying advance rather than a mixed up confusion. Admittedly, it depends on the numpy centric ability to have views, but views are a wonderful thing. Chuck

Dag Sverre Seljebotn

2:26 p.m.

New subject: alterNEP - was: missing data discussion round 2

On 06/30/2011 04:17 PM, Charles R Harris wrote:

...

On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com <mailto:matthew.brett@gmail.com>> wrote:

Hi,

On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com <mailto:njs@pobox.com>> wrote: > Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: ... [some proposals]

In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood:

############################################### A alternative-NEP on masking and missing values ###############################################

The principle of this aNEP is to separate the APIs for masking and for missing values, according to

* The current implementation of masked arrays * Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.

************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

>>> np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

>>> np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

****** Ufuncs ******

By default, NA values propagate::

>>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) >>> np.sum(na_arr) NA('float64')

unless the ``skipna`` flag is set::

>>> np.sum(na_arr, skipna=True) 10.0

By default, masking does not propagate::

>>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) >>> np.sum(masked_arr) 10.0

unless the ``propmsk`` flag is set::

>>> np.sum(masked_arr, propmsk=True) MASKED

An array can be masked, and contain NA values::

>>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])

In the default case, the behavior is obvious::

>>> np.sum(both_arr) NA('float64')

It's also obvious what to do with ``skipna=True``::

>>> np.sum(both_arr, skipna=True) 10.0 >>> np.sum(both_arr, skipna=True, propmsk=True) MASKED

To break the tie between NA and MSK, NAs propagate harder::

>>> np.sum(both_arr, propmsk=True) NA('float64')

********** Assignment **********

is obvious in the NA case::

>>> arr = np.array([1.0, 2.0, 7.0]) >>> arr[2] = np.NA TypeError('dtype does not support NA') >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>> na_arr[2] = np.NA >>> na_arr array([1., 2., NA], dtype='NA[<f8]')

Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>> masked_arr[2] = np.NA TypeError('dtype does not support NA') >>> masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') >>> masked_arr.visible[2] = False >>> masked_arr array([1., 2., MASKED], masked=True)

See y'all,

I honestly don't see the problem here. The difference isn't between masked_values/missing_values, it is between masked arrays and masked views of unmasked arrays. I think the view concept is central to what is going on. It may not be what folks are used to, but it strikes me as a clarifying advance rather than a mixed up confusion. Admittedly, it depends on the numpy centric ability to have views, but views are a wonderful thing.

So a) how do you propose that reductions behave?, b) what semantics for the []= operator do you propose? That would clarify why you don't see a problem.. Dag Sverre

Charles R Harris

2:27 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 8:17 AM, Charles R Harris <charlesr.harris@gmail.com

...

wrote:

...

On Thu, Jun 30, 2011 at 7:31 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...
Hi,

On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
Anyway, it's pretty clear that in this particular case, there are two distinct features that different people want: the missing data feature, and the masked array feature. The more I think about it, the less I see how they can be combined into one dessert topping + floor wax solution. Here are three particular points where they seem to contradict each other: ... [some proposals]

In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood:

############################################### A alternative-NEP on masking and missing values ###############################################

The principle of this aNEP is to separate the APIs for masking and for missing values, according to

* The current implementation of masked arrays * Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.

************** Initialization **************

First, missing values can be set and be displayed as ``np.NA, NA``::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype::

...
...
...
np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as ``np.MASKED, MASKED``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without ``masked=True``::

...
...
...
np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True)

****** Ufuncs ******

By default, NA values propagate::

...
...
...
na_arr = np.array([1.0, 2.0, np.NA, 7.0]) np.sum(na_arr) NA('float64')

unless the ``skipna`` flag is set::

...
...
...
np.sum(na_arr, skipna=True) 10.0

By default, masking does not propagate::

...
...
...
masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) np.sum(masked_arr) 10.0

unless the ``propmsk`` flag is set::

...
...
...
np.sum(masked_arr, propmsk=True) MASKED

An array can be masked, and contain NA values::

...
...
...
both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])

In the default case, the behavior is obvious::

...
...
...
np.sum(both_arr) NA('float64')

It's also obvious what to do with ``skipna=True``::

...
...
...
np.sum(both_arr, skipna=True) 10.0 np.sum(both_arr, skipna=True, propmsk=True) MASKED

To break the tie between NA and MSK, NAs propagate harder::

...
...
...
np.sum(both_arr, propmsk=True) NA('float64')

********** Assignment **********

is obvious in the NA case::

...
...
...
arr = np.array([1.0, 2.0, 7.0]) arr[2] = np.NA TypeError('dtype does not support NA') na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') na_arr[2] = np.NA na_arr array([1., 2., NA], dtype='NA[<f8]')

Direct assignnent in the masked case is magic and confusing, and so happens only via the mask::

...
...
...
masked_array = np.array([1.0, 2.0, 7.0], masked=True) masked_arr[2] = np.NA TypeError('dtype does not support NA') masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') masked_arr.visible[2] = False masked_arr array([1., 2., MASKED], masked=True)

See y'all,

I honestly don't see the problem here. The difference isn't between masked_values/missing_values, it is between masked arrays and masked views of unmasked arrays. I think the view concept is central to what is going on. It may not be what folks are used to, but it strikes me as a clarifying advance rather than a mixed up confusion. Admittedly, it depends on the numpy centric ability to have views, but views are a wonderful thing.

OK, I can see a problem in that currently the only way to unmask a value is by assignment of a valid value to the underlying data array, that is the missing data idea. For masked data, it might be convenient to have something that only affected the mask instead of having to take another view of the unmasked data and reconstructing the mask with some modifications. So that could maybe be done with a "soft" np.CLEAR that only worked on views of unmasked arrays. Chuck

Nathaniel Smith

5:51 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...

In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version: https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes: https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk". One thing I wonder about the design is whether having an np.MASKED/np.IGNORE value at all helps or hurts. (Occam tells us never to multiply entities without necessity! And it's a bit of an odd fit to the masking concept, since the whole idea is that masking is a property of the array, not the individual datums.) Currently, I see the following uses for it: -- As a return value when someone tries to scalar-index a masked value -- As a placeholder to specify masked values when creating an array from a list (but not when assigning to an array later) -- As a return value when using propmask=True -- As something to display when printing a masked array Another way of doing things would be: -- Scalar-indexing a masked value returns an error, like trying to index past the end of an array. (Slicing etc. would still return a new masked array.) -- Having some sort of placeholder does seem nice, but I'm not sure how often you need to type out a masked array. And I notice that numpy.ma does support this (like so: ma.array([1, ma.masked, 3])) but the examples in the docs never use it. The replacement idiom would be something like: my_data = np.array([1, 999, 3], masked=True); my_data.visible = (my_data != 999). So maybe just leave out the placeholder value, at least for version 1? -- I don't really see the logic for supporting 'propmask' at all. AFAICT no-one has ever even considered this as a useful feature for numpy.ma, never mind implemented it? -- When printing, the numpy.ma approach of using "--" seems much more readable than me than having "IGNORE" all over my screen. So overall, making these changes would let us simplify the design. But maybe propmask is really critical for some use case, or there's some good reason to want to scalar-index missing values without getting an error? -- Nathaniel

Matthew Brett

5:58 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Thu, Jun 30, 2011 at 6:51 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version: https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes: https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

Thanks for doing that.

...

One thing I wonder about the design is whether having an np.MASKED/np.IGNORE value at all helps or hurts. (Occam tells us never to multiply entities without necessity! And it's a bit of an odd fit to the masking concept, since the whole idea is that masking is a property of the array, not the individual datums.)

Currently, I see the following uses for it: -- As a return value when someone tries to scalar-index a masked value -- As a placeholder to specify masked values when creating an array from a list (but not when assigning to an array later) -- As a return value when using propmask=True -- As something to display when printing a masked array

Another way of doing things would be: -- Scalar-indexing a masked value returns an error, like trying to index past the end of an array. (Slicing etc. would still return a new masked array.) -- Having some sort of placeholder does seem nice, but I'm not sure how often you need to type out a masked array. And I notice that numpy.ma does support this (like so: ma.array([1, ma.masked, 3])) but the examples in the docs never use it. The replacement idiom would be something like: my_data = np.array([1, 999, 3], masked=True); my_data.visible = (my_data != 999). So maybe just leave out the placeholder value, at least for version 1? -- I don't really see the logic for supporting 'propmask' at all. AFAICT no-one has ever even considered this as a useful feature for numpy.ma, never mind implemented it? -- When printing, the numpy.ma approach of using "--" seems much more readable than me than having "IGNORE" all over my screen.

So overall, making these changes would let us simplify the design. But maybe propmask is really critical for some use case, or there's some good reason to want to scalar-index missing values without getting an error?

I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs? See you, Matthew

Lluís

6:27 p.m.

New subject: alterNEP - was: missing data discussion round 2

Matthew Brett writes: [...]

...

I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs?

As I tried to convey on my other mail, separating both will force you to either: * Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data) or * Tell the other routine whether it should use np.NA or np.IGNORE *and* whether it should use "skipna" and/or "propmask". To me, that's the whole point about a unified API: * Avoid making array copies. * Do not add more arguments to *all* routines (to tell them which kind of missing data they should produce, and which kind of missing data they should ignore/propagate). Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

Nathaniel Smith

6:49 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 11:27 AM, Lluís <xscript@gmx.net> wrote:

...

As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

To help me understand, do you have an example in mind of a routine that would do that? I can't think of any cases where I had some original data that some routine wanted to throw out and replace with NAs; it just seems... weird. Maybe I'm missing something though... (I can imagine that it would make sense for what we're calling a masked array, where you have some routine which computes which values should be ignored for a particular purpose. But if it only makes sense for masked arrays then you can just write your routine to work with masked arrays only, and it doesn't matter how similar the masking and missing APIs are.) -- Nathaniel

Lluís

7:06 p.m.

New subject: alterNEP - was: missing data discussion round 2

Nathaniel Smith writes:

...

On Thu, Jun 30, 2011 at 11:27 AM, Lluís <xscript@gmx.net> wrote:

...
As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

...

To help me understand, do you have an example in mind of a routine that would do that? I can't think of any cases where I had some original data that some routine wanted to throw out and replace with NAs; it just seems... weird. Maybe I'm missing something though...

Well, I had some silly example on another thread. A function that computes the mean of all non-NA values, and assigns NA to all cells that are beyond certain threshold of that mean value.

...

(I can imagine that it would make sense for what we're calling a masked array, where you have some routine which computes which values should be ignored for a particular purpose. But if it only makes sense for masked arrays then you can just write your routine to work with masked arrays only, and it doesn't matter how similar the masking and missing APIs are.)

The routine makes sense by itself as a beyond-mean detector. The routine must not care whether your NAs are transient or not (in your aNEP, whether you want it to assign np.NA or np.IGNORE, which must be indicated by the caller through yet another function argument). Note that callers will not only have to indicate which "type" of missing data the calle should use (np.NA or np.IGNORE), but they also have to indicate whether np.NAs must be ignored (i.e., skipna=bool), as well as np.IGNORE (i.e., propmask=bool). Of course it is doable, but adding 3 more arguments to *all* functions (including ufuncs, and higher-level functions) does not seem as desirable to me. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

Matthew Brett

7:03 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Thu, Jun 30, 2011 at 7:27 PM, Lluís <xscript@gmx.net> wrote:

...

Matthew Brett writes: [...]

...
I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs?

As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

You have an array 'arr'. The array does support NAs, but it doesn't have a mask. You want to pass ``arr`` to another routine ``func``. You expect ``func`` to set NAs into the data but you don't want ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. You are saying the following: "with the fused API, I can make ``arr`` be a masked array, and pass it into ``func``, and know that, when func sets elements of arr to NA, it will only modify the mask and not the underlying data in ``arr``." It does seem to me this is a very obscure case. First, ``func`` is modifying the array but you want an unmodified array back. Second, you'll have to do some view trick to recover the not-NA case to arr, when it comes back. It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least. Best, Matthew

Lluís

8:01 p.m.

New subject: alterNEP - was: missing data discussion round 2

Matthew Brett writes:

...

Hi, On Thu, Jun 30, 2011 at 7:27 PM, Lluís <xscript@gmx.net> wrote:

...
Matthew Brett writes: [...]

...
I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs?

As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

...

You have an array 'arr'. The array does support NAs, but it doesn't have a mask. You want to pass ``arr`` to another routine ``func``. You expect ``func`` to set NAs into the data but you don't want ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. You are saying the following:

...

"with the fused API, I can make ``arr`` be a masked array, and pass it into ``func``, and know that, when func sets elements of arr to NA, it will only modify the mask and not the underlying data in ``arr``."

Yes.

...

It does seem to me this is a very obscure case. First, ``func`` is modifying the array but you want an unmodified array back. Second, you'll have to do some view trick to recover the not-NA case to arr, when it comes back.

I know, the example is just silly and convoluted.

...

It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least.

I agree on this example. My only concern is on the API's ability to foresee as most future use-cases as possible, without impacting performance. 1) On one hand, we have that functions must be specially crafted to handle transient NA (i.e., create a masked array to store the output, which will be possibly optional, so it needs another function argument). And not everybody will foresee such usage, resulting in an inconsistent API w.r.t. np.NA vs np.IGNORE. We could alternatively see this as a knob to say, whenever you store np.NA, please use np.IGNORE. It all needs collaboration from the callee. 2) On the other hand, we have that it can all be controlled by the caller, who is really the only one that knows its needs. This, at the risk of confusing the user (I still believe the user should not be confused because the mask must be explicitly activated). If you're telling me "2 is not necessary because functions written as 1 are few and clearly identified", then I'll just say I don't know. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

Matthew Brett

July 2011

12:02 a.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Thu, Jun 30, 2011 at 9:01 PM, Lluís <xscript@gmx.net> wrote:

...

Matthew Brett writes:

...
Hi, On Thu, Jun 30, 2011 at 7:27 PM, Lluís <xscript@gmx.net> wrote:

...
Matthew Brett writes: [...]

...
I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs?

As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

...
You have an array 'arr'. The array does support NAs, but it doesn't have a mask. You want to pass ``arr`` to another routine ``func``. You expect ``func`` to set NAs into the data but you don't want ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. You are saying the following:

...
"with the fused API, I can make ``arr`` be a masked array, and pass it into ``func``, and know that, when func sets elements of arr to NA, it will only modify the mask and not the underlying data in ``arr``."

Yes.

...
It does seem to me this is a very obscure case. First, ``func`` is modifying the array but you want an unmodified array back. Second, you'll have to do some view trick to recover the not-NA case to arr, when it comes back.

I know, the example is just silly and convoluted.

...
It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least.

I agree on this example. My only concern is on the API's ability to foresee as most future use-cases as possible, without impacting performance.

But, of course, there's a great danger in trying to cover every possible use-case. My argument is that the kind of cases that you are describe are - I believe - very rare and are even a little difficult to make up. Is that fair? To my mind, the separate NA and IGNORE API is easier to understand and explain. If that isn't true, please do say, and say why - because that point is key. If it is true that the separate API is clearer, then the benefit in terms of power and extensibility has to be large, in order to go for the fused API. Cheers, Matthew

Gary Strangman

12:38 a.m.

New subject: alterNEP - was: missing data discussion round 2

...

...
...
It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least.

I agree on this example. My only concern is on the API's ability to foresee as most future use-cases as possible, without impacting performance.

But, of course, there's a great danger in trying to cover every possible use-case.

My argument is that the kind of cases that you are describe are - I believe - very rare and are even a little difficult to make up. Is that fair?

To my mind, the separate NA and IGNORE API is easier to understand and explain. If that isn't true, please do say, and say why - because that point is key.

If it is true that the separate API is clearer, then the benefit in terms of power and extensibility has to be large, in order to go for the fused API.

For what it's worth, I wholeheartedly agree with Matthew here. Being able to designate NA separately from IGNORE has tremendous conceptual clarity, at least for me. Not only are these are completely separate mental constructs in my head, but they even arise from completely different sources: NAs arise from my subjects whims, my experimental procedures, my research personnel, or bad equipment days, whereas IGNORE generally comes from me and my analysis or visualization needs. While I bet it's possible for an exceedingly clever person to fuse the two (I doubt my brain could pull that off), I fear that in the end I would have to go to the documentation every time in order to use either one. Thus, I agree that fusing into a single API needs to have a very large benefit. I admit I haven't followed all steps here, but I sense there is indeed numpy-coder-level benefit to fusing. However I, like Matthew (I believe), don't see appreciable benefits at the user level, /plus/ the risk of user confusion ... -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

Charles R Harris

1:10 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 6:02 PM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Thu, Jun 30, 2011 at 9:01 PM, Lluís <xscript@gmx.net> wrote:

...
Matthew Brett writes:

...
Hi, On Thu, Jun 30, 2011 at 7:27 PM, Lluís <xscript@gmx.net> wrote:

...
Matthew Brett writes: [...]

...
I'm afraid, like you, I'm a little lost in the world of masking, because I only need the NAs. I was trying to see if I could come up with an API that picked up some of the syntactic convenience of NAs, without conflating NAs with IGNOREs. I guess we need some feedback from the 'NA & IGNORE Share the API' (NISA?) proponents to get an idea of what we've missed. @Mark, @Chuck, guys - what have we lost here by separating the APIs?

As I tried to convey on my other mail, separating both will force you to either:

* Make a copy of the array before passing it to another routine (because the routine will assign np.NA but you still want the original data)

...
You have an array 'arr'. The array does support NAs, but it doesn't have a mask. You want to pass ``arr`` to another routine ``func``. You expect ``func`` to set NAs into the data but you don't want ``func`` to modify ``arr`` and you don't want to copy ``arr`` either. You are saying the following:

...
"with the fused API, I can make ``arr`` be a masked array, and pass it into ``func``, and know that, when func sets elements of arr to NA, it will only modify the mask and not the underlying data in ``arr``."

Yes.

...
It does seem to me this is a very obscure case. First, ``func`` is modifying the array but you want an unmodified array back. Second, you'll have to do some view trick to recover the not-NA case to arr, when it comes back.

I know, the example is just silly and convoluted.

...
It seems to me, that what ``func`` should do, if it wants you to be able to unmask the NAs, is to make a masked array view of ``arr``, and return that. And indeed the simplicity of the separated API immediately makes that clear - in my view at least.

I agree on this example. My only concern is on the API's ability to foresee as most future use-cases as possible, without impacting performance.

But, of course, there's a great danger in trying to cover every possible use-case.

My argument is that the kind of cases that you are describe are - I believe - very rare and are even a little difficult to make up. Is that fair?

To my mind, the separate NA and IGNORE API is easier to understand and explain. If that isn't true, please do say, and say why - because that point is key.

I think the main problem is that they aren't separate, one takes place in a view of an unmasked array, the other starts with a masked array. These aren't 'different' in mechanism, they are just different in work flow. And I think they fit in well with the view idea.

...

If it is true that the separate API is clearer, then the benefit in terms of power and extensibility has to be large, in order to go for the fused API.

Chuck

Keith Goodman

1:36 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version: https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes: https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP. I only had a few minutes, so I only took it this far (1d arrays only): >> from nary import nary, NA, IGNORE >> arr = np.array([1,2,3,4,5,6]) >> nar = nary(arr) >> nar 1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000, >> nar[2] = NA >> nar 1.0000, 2.0000, NA, 4.0000, 5.0000, 6.0000, >> nar[4] = IGNORE >> nar 1.0000, 2.0000, NA, 4.0000, IGNORE, 6.0000, >> nar[4] IGNORE >> nar[3] 4 >> nar[2] NA The gist is here: https://gist.github.com/1057686 It probably just needs an __add__ and a reducing function such as sum, but I'm out of time, or so my family tells me. Implementation? Yes, with masks.

Matthew Brett

11:58 a.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version: https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes: https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this. I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing. @Mark @Chuck @anyone Do you see problems with the alterNEP proposal? If so, what are they? Do you agree that the alterNEP proposal is easier to understand? If not, can you explain why? What do you see as the important points of difference between the NEP and the alterNEP? @Pierre - what do you think? Best, Matthew

Mark Wiebe

2:09 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version:

https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191

...
And I made a few changes:

https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583

...
Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...

If so, what are they?

Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.

...

Do you agree that the alterNEP proposal is easier to understand?

No. If not, can you explain why?

...

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

...

What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work. The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github. Thanks, -Mark @Pierre - what do you think?

...

Best,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthew Brett

2:50 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version:

https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes:

https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

...

Mainly: Reduced interoperability

Meaning?

...

more complex implementation (leading to more bugs),

OK - but the discussion did not seem to be about the complexity of the implementation, but about the API.

...

and an unclear theoretical model for the masked part of i

What's unclear? Or even different?

...

...
Do you agree that the alterNEP proposal is easier to understand?

No.

Do you agree that there are several people on the list who do thing that the alterNEP proposal is easier to understand?

...

...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I can't see any reference to the alterNEP or the idea of the separate API in the NEP. Can you point me to it?

...

...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

Lluis gave a particular somewhat obscure case where it is convenient that the NA and IGNORE are the same. Are there any others? It seems to me the API you propose is a classic example of implicit rather than explicit, and that it would be very easy, at this stage, to fix that.

...

The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

OK - unless you tell me differently I'l take that as 'the discussion of the separate API for NA and IGNORE is over as far as I am concerned'. I would say, for future reference, that if there is a substantial and reasonable discussion of the API, that is not well resolved, then it does harm to go ahead and implement regardless. Specifically, it demoralizes those of us who put energy into trying to have a substantial reasoned discussion. I think that's bad for the list and bad for the community. See you, Matthew

Mark Wiebe

3:34 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com>

wrote:

...
...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version:

https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191

...
And I made a few changes:

https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583

...
Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation? I made a pull request implementing a non-controversial part of the NEP to get started, and I've not seen any feedback on except from Chuck and Derek. (Many thanks to Chuck and Derek!) Implementation and design are tied together in a feedback loop, and separate designs that aren't informed by the implementation details, for example information gained by going through the proposed code changes and reviewing them, are counterproductive. I appreciate the effort you're putting in, and I've been trying to guide you towards a more holistic path of contribution by pointing out the pull request.

...

Mainly: Reduced interoperability

Meaning?

You can't switch between the two approaches without big changes in your code.

...

...
more complex implementation (leading to more bugs),

OK - but the discussion did not seem to be about the complexity of the implementation, but about the API.

The implementation always plays a role in the design of anything. Making an API design abstractly, then testing it against implementation constraints is good, making an API completely divorced from considerations of implementation is really really bad.

...

...
and an unclear theoretical model for the masked part of i

What's unclear? Or even different?

After thinking about the missing data model some more, I've come up with more rationale for why the R approach is good, and adopting both the R default and skipna option is appropriate. It's in the pull request up for code review.

...

...
...
Do you agree that the alterNEP proposal is easier to understand?

No.

Do you agree that there are several people on the list who do thing that the alterNEP proposal is easier to understand?

Feedback on the clarity of my writing in the NEP is welcome, if something is unclear to someone, please point out the specific part so I can continue to improve it. I don't think the clarity of the writing is a good reason for choosing one design or another, the quality of the design is what should decide that.

...

...
...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I can't see any reference to the alterNEP or the idea of the separate API in the NEP. Can you point me to it?

I'm referring to positive arguments for why the design decisions are as they are. I don't see the alterNEP referencing specific things that are wrong with the NEP either, it just assumes sharing the API is a bad idea without making clearly stated arguments for or against it.

...

...
What do you see as the important points of difference between the NEP

...
and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

Lluis gave a particular somewhat obscure case where it is convenient that the NA and IGNORE are the same. Are there any others? It seems to me the API you propose is a classic example of implicit rather than explicit, and that it would be very easy, at this stage, to fix that.

And I came up with a nice way to deal with this situation through a subclass of ndarray changing the default 'skipna=' parameter value. The "implicit vs explicit" quote is overused, but even so I've applied the idea very carefully. In the NEP, you never get missing value support unless you explicitly request it.

...

The current pull request that's sitting there waiting for review does not

...
have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

OK - unless you tell me differently I'l take that as 'the discussion of the separate API for NA and IGNORE is over as far as I am concerned'.

Yes, because I'm not seeing arguments responding with specific examples or use cases showing why a separate API is better, in particular which deal with the arguments I've given indicating why sharing the API is useful. I would say, for future reference, that if there is a substantial and

...

reasonable discussion of the API, that is not well resolved, then it does harm to go ahead and implement regardless. Specifically, it demoralizes those of us who put energy into trying to have a substantial reasoned discussion. I think that's bad for the list and bad for the community.

You might have consideration for morale of those who are putting substantial effort into designing and implementing it as well. The ecosystem is not just this mailing list, it also is the code and documentation review process on github, and when people who only participate on the mailing list are tearing apart carefully constructed designs based in part on some mischaracterizations of those designs, then expecting to be corrected each time instead of studying the proposed design to understand and compare it to their competing ideas, it's harder and harder to keep responding with corrections. I appreciate your feedback, the design for the NA bit pattern approach that is in the NEP is inspired by your feedback for wanting that style of NA functionality. Thanks, Mark

...

See you,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

3:48 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 9:34 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...
Hi,

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com>

wrote:

...
...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote: > In the interest of making the discussion as concrete as possible, here > is my draft of an alternative proposal for NAs and masking, based on > Nathaniel's comments. Writing it, it seemed to me that Nathaniel is > right, that the ideas become much clearer when the NA idea and the > MASK idea are separate. Please do pitch in for things I may have > missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version:

https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191

...
And I made a few changes:

https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583

...
Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote: people

...
...
...
can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation? I made a pull request implementing a non-controversial part of the NEP to get started, and I've not seen any feedback on except from Chuck and Derek. (Many thanks to Chuck and Derek!) Implementation and design are tied together in a feedback loop, and separate designs that aren't informed by the implementation details, for example information gained by going through the proposed code changes and reviewing them, are counterproductive. I appreciate the effort you're putting in, and I've been trying to guide you towards a more holistic path of contribution by pointing out the pull request.

...
Mainly: Reduced interoperability

Meaning?

You can't switch between the two approaches without big changes in your code.

...
...
more complex implementation (leading to more bugs),

OK - but the discussion did not seem to be about the complexity of the implementation, but about the API.

The implementation always plays a role in the design of anything. Making an API design abstractly, then testing it against implementation constraints is good, making an API completely divorced from considerations of implementation is really really bad.

...
...
and an unclear theoretical model for the masked part of i

What's unclear? Or even different?

After thinking about the missing data model some more, I've come up with more rationale for why the R approach is good, and adopting both the R default and skipna option is appropriate. It's in the pull request up for code review.

...
...
...
Do you agree that the alterNEP proposal is easier to understand?

No.

Do you agree that there are several people on the list who do thing that the alterNEP proposal is easier to understand?

Feedback on the clarity of my writing in the NEP is welcome, if something is unclear to someone, please point out the specific part so I can continue to improve it. I don't think the clarity of the writing is a good reason for choosing one design or another, the quality of the design is what should decide that.

...
...
...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I can't see any reference to the alterNEP or the idea of the separate API in the NEP. Can you point me to it?

I'm referring to positive arguments for why the design decisions are as they are. I don't see the alterNEP referencing specific things that are wrong with the NEP either, it just assumes sharing the API is a bad idea without making clearly stated arguments for or against it.

...
...
What do you see as the important points of difference between the NEP

...
and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

Lluis gave a particular somewhat obscure case where it is convenient that the NA and IGNORE are the same. Are there any others? It seems to me the API you propose is a classic example of implicit rather than explicit, and that it would be very easy, at this stage, to fix that.

And I came up with a nice way to deal with this situation through a subclass of ndarray changing the default 'skipna=' parameter value. The "implicit vs explicit" quote is overused, but even so I've applied the idea very carefully. In the NEP, you never get missing value support unless you explicitly request it.

...
...
have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP

The current pull request that's sitting there waiting for review does not that

...
way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

OK - unless you tell me differently I'l take that as 'the discussion of the separate API for NA and IGNORE is over as far as I am concerned'.

Yes, because I'm not seeing arguments responding with specific examples or use cases showing why a separate API is better, in particular which deal with the arguments I've given indicating why sharing the API is useful.

I would say, for future reference, that if there is a substantial and

...
reasonable discussion of the API, that is not well resolved, then it does harm to go ahead and implement regardless. Specifically, it demoralizes those of us who put energy into trying to have a substantial reasoned discussion. I think that's bad for the list and bad for the community.

You might have consideration for morale of those who are putting substantial effort into designing and implementing it as well. The ecosystem is not just this mailing list, it also is the code and documentation review process on github, and when people who only participate on the mailing list are tearing apart carefully constructed designs based in part on some mischaracterizations of those designs, then expecting to be corrected each time instead of studying the proposed design to understand and compare it to their competing ideas, it's harder and harder to keep responding with corrections.

I appreciate your feedback, the design for the NA bit pattern approach that is in the NEP is inspired by your feedback for wanting that style of NA functionality.

Speaking for myself, at this point I'd rather have Mark writing code than getting sucked into a long thread about alternative designs. I think the point about getting more involved with the implementation review process is a good one. When we have a prototype to play with, then we can see if it is adequate to the needs of the various users and at that point feedback is essential. I expect Mark will be begging for people to try out the code at that point, both to find bugs and to improve the API. I hope you all rise to the occasion. Chuck

Matthew Brett

4:08 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 4:48 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 9:34 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote: > On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett > <matthew.brett@gmail.com> wrote: >> In the interest of making the discussion as concrete as possible, >> here >> is my draft of an alternative proposal for NAs and masking, based >> on >> Nathaniel's comments. Writing it, it seemed to me that Nathaniel >> is >> right, that the ideas become much clearer when the NA idea and the >> MASK idea are separate. Please do pitch in for things I may have >> missed or misunderstood: > [...] > > Thanks for writing this up! I stuck it up as a gist so we can edit > it > more easily: > https://gist.github.com/1056379/ > This is your initial version: > > > https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 > And I made a few changes: > > > https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 > Specifically, I added a rationale section, changed np.MASKED to > np.IGNORE (as per comments in this thread), and added a vowel to > "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation? I made a pull request implementing a non-controversial part of the NEP to get started, and I've not seen any feedback on except from Chuck and Derek. (Many thanks to Chuck and Derek!) Implementation and design are tied together in a feedback loop, and separate designs that aren't informed by the implementation details, for example information gained by going through the proposed code changes and reviewing them, are counterproductive. I appreciate the effort you're putting in, and I've been trying to guide you towards a more holistic path of contribution by pointing out the pull request.

...
...
Mainly: Reduced interoperability

Meaning?

You can't switch between the two approaches without big changes in your code.

...
...
more complex implementation (leading to more bugs),

OK - but the discussion did not seem to be about the complexity of the implementation, but about the API.

The implementation always plays a role in the design of anything. Making an API design abstractly, then testing it against implementation constraints is good, making an API completely divorced from considerations of implementation is really really bad.

...
...
and an unclear theoretical model for the masked part of i

What's unclear? Or even different?

After thinking about the missing data model some more, I've come up with more rationale for why the R approach is good, and adopting both the R default and skipna option is appropriate. It's in the pull request up for code review.

...
...
...
Do you agree that the alterNEP proposal is easier to understand?

No.

Do you agree that there are several people on the list who do thing that the alterNEP proposal is easier to understand?

Feedback on the clarity of my writing in the NEP is welcome, if something is unclear to someone, please point out the specific part so I can continue to improve it. I don't think the clarity of the writing is a good reason for choosing one design or another, the quality of the design is what should decide that.

...
...
...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I can't see any reference to the alterNEP or the idea of the separate API in the NEP. Can you point me to it?

I'm referring to positive arguments for why the design decisions are as they are. I don't see the alterNEP referencing specific things that are wrong with the NEP either, it just assumes sharing the API is a bad idea without making clearly stated arguments for or against it.

...
...
...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

Lluis gave a particular somewhat obscure case where it is convenient that the NA and IGNORE are the same. Are there any others? It seems to me the API you propose is a classic example of implicit rather than explicit, and that it would be very easy, at this stage, to fix that.

And I came up with a nice way to deal with this situation through a subclass of ndarray changing the default 'skipna=' parameter value. The "implicit vs explicit" quote is overused, but even so I've applied the idea very carefully. In the NEP, you never get missing value support unless you explicitly request it.

...
...
The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

OK - unless you tell me differently I'l take that as 'the discussion of the separate API for NA and IGNORE is over as far as I am concerned'.

Yes, because I'm not seeing arguments responding with specific examples or use cases showing why a separate API is better, in particular which deal with the arguments I've given indicating why sharing the API is useful.

...
I would say, for future reference, that if there is a substantial and reasonable discussion of the API, that is not well resolved, then it does harm to go ahead and implement regardless. Specifically, it demoralizes those of us who put energy into trying to have a substantial reasoned discussion. I think that's bad for the list and bad for the community.

You might have consideration for morale of those who are putting substantial effort into designing and implementing it as well. The ecosystem is not just this mailing list, it also is the code and documentation review process on github, and when people who only participate on the mailing list are tearing apart carefully constructed designs based in part on some mischaracterizations of those designs, then expecting to be corrected each time instead of studying the proposed design to understand and compare it to their competing ideas, it's harder and harder to keep responding with corrections. I appreciate your feedback, the design for the NA bit pattern approach that is in the NEP is inspired by your feedback for wanting that style of NA functionality.

Speaking for myself, at this point I'd rather have Mark writing code than getting sucked into a long thread about alternative designs. I think the point about getting more involved with the implementation review process is a good one. When we have a prototype to play with, then we can see if it is adequate to the needs of the various users and at that point feedback is essential. I expect Mark will be begging for people to try out the code at that point, both to find bugs and to improve the API. I hope you all rise to the occasion.

Continuing a discussion that started off-list - I would humbly ask that we avoid the more corporate 'get behind the team' mentality here. It's not the open-source way, and no-one enjoys it. I don't think anyone is in any doubt that Mark's work has the potential to be extremely useful and important here. That makes it all the more important that we discuss it fully without recourse to 'too much talking not enough typing'. Discussion is the root of good decisions [1]. We should value discussion and place it highly on our priorities. We all of us write real code here, and know when a draft is the next best step. Some of us believe we did not get there in this case. Best, Matthew [1] http://en.wikipedia.org/wiki/Alan_Brooke,_1st_Viscount_Alanbrooke#Relationsh...

Matthew Brett

4 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 4:34 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett <matthew.brett@gmail.com> wrote: > In the interest of making the discussion as concrete as possible, > here > is my draft of an alternative proposal for NAs and masking, based > on > Nathaniel's comments. Writing it, it seemed to me that Nathaniel > is > right, that the ideas become much clearer when the NA idea and the > MASK idea are separate. Please do pitch in for things I may have > missed or misunderstood: [...]

Thanks for writing this up! I stuck it up as a gist so we can edit it more easily: https://gist.github.com/1056379/ This is your initial version:

https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 And I made a few changes:

https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 Specifically, I added a rationale section, changed np.MASKED to np.IGNORE (as per comments in this thread), and added a vowel to "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation?

No, but, this is an open source project. Hence it matters not only what gets written but how the decisions are made and quality of the discussion. Here what I see is that you lost interest in the discussion some time ago and stopped responding in any specific way. This unfortunately conveys a lack of interest in our views. That might not be true, in which case I'm sure you can convey the opposite with some substantial discsussion now. Or it might be for good reason, heaven knows I've been wrong enough times. But the community cost is high for the sake of an extra few days implementation time. Frankly I think the API will also suffer, but I'm less certain about that.

...

I made a pull request implementing a non-controversial part of the NEP to get started, and I've not seen any feedback on except from Chuck and Derek. (Many thanks to Chuck and Derek!) Implementation and design are tied together in a feedback loop, and separate designs that aren't informed by the implementation details, for example information gained by going through the proposed code changes and reviewing them, are counterproductive. I appreciate the effort you're putting in, and I've been trying to guide you towards a more holistic path of contribution by pointing out the pull request.

Holistic? You surely accept that code review is not the mechanism for high-level API decisions?

...

...
...
Mainly: Reduced interoperability

Meaning?

You can't switch between the two approaches without big changes in your code.

Lluis provided a case, and it was obscure. That switch seems like a rare or non-existent use-case that should not guide the API.

...

...
...
more complex implementation (leading to more bugs),

OK - but the discussion did not seem to be about the complexity of the implementation, but about the API.

The implementation always plays a role in the design of anything. Making an API design abstractly, then testing it against implementation constraints is good, making an API completely divorced from considerations of implementation is really really bad.

Making major API decisions on the basis of implementation ease is also bad because it leads to a bad API and a bad API leads to confusion, and makes people use the feature less. You spent considerable energy trying to persuade us that we should not worry about the implementation, and that it was a detail. Now you are telling us that your chose the API for the implementation. All that is fine, but it is not fine to imply that the discussion of the API is a waste of your time.

...

...
...
and an unclear theoretical model for the masked part of i

What's unclear? Or even different?

After thinking about the missing data model some more, I've come up with more rationale for why the R approach is good, and adopting both the R default and skipna option is appropriate. It's in the pull request up for code review.

...
...
...
Do you agree that the alterNEP proposal is easier to understand?

No.

Do you agree that there are several people on the list who do thing that the alterNEP proposal is easier to understand?

Feedback on the clarity of my writing in the NEP is welcome, if something is unclear to someone, please point out the specific part so I can continue to improve it. I don't think the clarity of the writing is a good reason for choosing one design or another, the quality of the design is what should decide that.

It's difficult for me not to feel you are deliberately misunderstanding me here. I don't mean the writing, I mean the API.

...

...
...
...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I can't see any reference to the alterNEP or the idea of the separate API in the NEP. Can you point me to it?

I'm referring to positive arguments for why the design decisions are as they are. I don't see the alterNEP referencing specific things that are wrong with the NEP either, it just assumes sharing the API is a bad idea without making clearly stated arguments for or against it.

We've made that argument many times - that the masking use-case and the missing data use-case are separate, and imply different ufunc semantics, and different assignment semantics. You'll see the two ideas set out at the top of the aNEP, and Nathaniel has stated them clearly in his emails.

...

...
...
...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

Lluis gave a particular somewhat obscure case where it is convenient that the NA and IGNORE are the same. Are there any others? It seems to me the API you propose is a classic example of implicit rather than explicit, and that it would be very easy, at this stage, to fix that.

And I came up with a nice way to deal with this situation through a subclass of ndarray changing the default 'skipna=' parameter value. The "implicit vs explicit" quote is overused, but even so I've applied the idea very carefully. In the NEP, you never get missing value support unless you explicitly request it.

Explicit about NA rather than IGNORE

...

...
...
The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

OK - unless you tell me differently I'l take that as 'the discussion of the separate API for NA and IGNORE is over as far as I am concerned'.

Yes, because I'm not seeing arguments responding with specific examples or use cases showing why a separate API is better, in particular which deal with the arguments I've given indicating why sharing the API is useful.

What are those arguments? Are they really restricted to Lluis' case?

...

...
I would say, for future reference, that if there is a substantial and reasonable discussion of the API, that is not well resolved, then it does harm to go ahead and implement regardless. Specifically, it demoralizes those of us who put energy into trying to have a substantial reasoned discussion. I think that's bad for the list and bad for the community.

You might have consideration for morale of those who are putting substantial effort into designing and implementing it as well. The ecosystem is not just this mailing list, it also is the code and documentation review process on github, and when people who only participate on the mailing list are tearing apart carefully constructed designs based in part on some mischaracterizations of those designs,

What are the mischaracterizations?

...

then expecting to be corrected each time instead of studying the proposed design to understand and compare it to their competing ideas, it's harder and harder to keep responding with corrections.

In what sense have we failed to compare your design to ours? Are you really saying that our proposal was a poorly done piece of work and hence not worth delaying for?

...

I appreciate your feedback, the design for the NA bit pattern approach that is in the NEP is inspired by your feedback for wanting that style of NA functionality.

I'm glad it was useful, and sorry it was not more useful. Best, Matthew

Charles R Harris

4:15 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 10:00 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

...
On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <

matthew.brett@gmail.com>

...
...
wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> wrote: > On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett > <matthew.brett@gmail.com> wrote: >> In the interest of making the discussion as concrete as possible, >> here >> is my draft of an alternative proposal for NAs and masking, based >> on >> Nathaniel's comments. Writing it, it seemed to me that Nathaniel >> is >> right, that the ideas become much clearer when the NA idea and

On Fri, Jul 1, 2011 at 4:34 PM, Mark Wiebe <mwwiebe@gmail.com> wrote: the

...
...
...
...
...
>> MASK idea are separate. Please do pitch in for things I may have >> missed or misunderstood: > [...] > > Thanks for writing this up! I stuck it up as a gist so we can edit > it > more easily: > https://gist.github.com/1056379/ > This is your initial version: > > > https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 > And I made a few changes: > > > https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 > Specifically, I added a rationale section, changed np.MASKED to > np.IGNORE (as per comments in this thread), and added a vowel to > "propmsk".

It might be helpful to make a small toy class in python so that people can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation?

No, but, this is an open source project. Hence it matters not only what gets written but how the decisions are made and quality of the discussion. Here what I see is that you lost interest in the discussion some time ago and stopped responding in any specific way. This unfortunately conveys a lack of interest in our views. That might not be true, in which case I'm sure you can convey the opposite with some substantial discsussion now. Or it might be for good reason, heaven knows I've been wrong enough times. But the community cost is high for the sake of an extra few days implementation time. Frankly I think the API will also suffer, but I'm less certain about that.

What open source has trouble with isn't discussion, it's attracting active and competent developers. You should treat them as gifts from the $deity when they show up. If they are open and responsive to discussion, and I think Mark is, so much the better. Mind, you don't need to bow down and kiss their feet, but you should at least take the time to understand what they are doing so your criticisms and feedback are informed. Chuck

Matthew Brett

4:18 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 5:15 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 10:00 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 4:34 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 9:50 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 3:09 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Fri, Jul 1, 2011 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote: > On Thu, Jun 30, 2011 at 10:51 AM, Nathaniel Smith <njs@pobox.com> > wrote: >> On Thu, Jun 30, 2011 at 6:31 AM, Matthew Brett >> <matthew.brett@gmail.com> wrote: >>> In the interest of making the discussion as concrete as >>> possible, >>> here >>> is my draft of an alternative proposal for NAs and masking, >>> based >>> on >>> Nathaniel's comments. Writing it, it seemed to me that >>> Nathaniel >>> is >>> right, that the ideas become much clearer when the NA idea and >>> the >>> MASK idea are separate. Please do pitch in for things I may >>> have >>> missed or misunderstood: >> [...] >> >> Thanks for writing this up! I stuck it up as a gist so we can >> edit >> it >> more easily: >> https://gist.github.com/1056379/ >> This is your initial version: >> >> >> >> https://gist.github.com/1056379/c809715f4e9765db72908c605468304ea1eb2191 >> And I made a few changes: >> >> >> >> https://gist.github.com/1056379/33ba20300e1b72156c8fb655bd1ceef03f8a6583 >> Specifically, I added a rationale section, changed np.MASKED to >> np.IGNORE (as per comments in this thread), and added a vowel to >> "propmsk". > > It might be helpful to make a small toy class in python so that > people > can play around with NA and IGNORE from the alterNEP.

Thanks for doing this.

I don't know about you, but I don't know where to work on the discussion or draft implementation, because I am not sure where the disagreement is. Lluis has helpfully pointed out a specific case of interest. Pierre has fed back with some points of clarification. However, other than that, I'm not sure what we should be discussing.

@Mark @Chuck @anyone

Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

Ah - I think what you are saying is - too late I've started writing it.

Do you want me to spend my whole summer designing something before starting the implementation?

No, but, this is an open source project. Hence it matters not only what gets written but how the decisions are made and quality of the discussion. Here what I see is that you lost interest in the discussion some time ago and stopped responding in any specific way. This unfortunately conveys a lack of interest in our views. That might not be true, in which case I'm sure you can convey the opposite with some substantial discsussion now. Or it might be for good reason, heaven knows I've been wrong enough times. But the community cost is high for the sake of an extra few days implementation time. Frankly I think the API will also suffer, but I'm less certain about that.

What open source has trouble with isn't discussion, it's attracting active and competent developers. You should treat them as gifts from the $deity when they show up. If they are open and responsive to discussion, and I think Mark is, so much the better. Mind, you don't need to bow down and kiss their feet, but you should at least take the time to understand what they are doing so your criticisms and feedback are informed.

Are you now going to explain why you believe our criticisms and feedback are not well informed? See you, Matthew

Benjamin Root

4:17 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 11:00 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

...
You can't switch between the two approaches without big changes in your code.

...
Lluis provided a case, and it was obscure. That switch seems like a rare or non-existent use-case that should not guide the API.

Just to respond to this specific issue. In matplotlib, there are often constructs like the following: plot_something(X, Y, V)

...

From a module perspective, we have no clue about the nature of the input data. We often have to do things like np.asanyarray, np.atleast_2d and such to establish some base-level assumptions about the input data. Numpy currently makes this fairly cheap by not performing a copy if it is not needed. So far, so good.

Next, some plotting functions needs to broadcast the arrays together (again, numpy makes that fairly cheap). Then, we need to figure out the common elements to plot. With something simple like plot(), this is straight-forward or-ing of any masks. Of course, right now, this is not cheap because we can't assume that the array supports masking semantics. This is where we either cast the arrays as masked arrays, or perform our own masking semantics. But, essentially, a point that was masked in X, may not be masked in Y and/or V, and we can not change the original data (or else we would be a bad tool). For more complicated functions like pcolor() and contour(), the arrays needs to know what the status of the neighboring points in itself, and for the other arrays. Again, either we use numpy.ma to share a common mask across the data arrays, or we implement our own semantics to deal with this. And again, we can not change any of the original data. This is not an obscure case. This is existing code in matplotlib. I will be evaluating the current missingdata branch later today to assess its suitability for use in matplotlib. Ben Root

Matthew Brett

4:20 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 5:17 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Fri, Jul 1, 2011 at 11:00 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
...
You can't switch between the two approaches without big changes in your code.

...
Lluis provided a case, and it was obscure. That switch seems like a rare or non-existent use-case that should not guide the API.

Just to respond to this specific issue.

In matplotlib, there are often constructs like the following:

plot_something(X, Y, V)

From a module perspective, we have no clue about the nature of the input data. We often have to do things like np.asanyarray, np.atleast_2d and such to establish some base-level assumptions about the input data. Numpy currently makes this fairly cheap by not performing a copy if it is not needed. So far, so good.

Next, some plotting functions needs to broadcast the arrays together (again, numpy makes that fairly cheap).

Then, we need to figure out the common elements to plot. With something simple like plot(), this is straight-forward or-ing of any masks. Of course, right now, this is not cheap because we can't assume that the array supports masking semantics. This is where we either cast the arrays as masked arrays, or perform our own masking semantics. But, essentially, a point that was masked in X, may not be masked in Y and/or V, and we can not change the original data (or else we would be a bad tool).

For more complicated functions like pcolor() and contour(), the arrays needs to know what the status of the neighboring points in itself, and for the other arrays. Again, either we use numpy.ma to share a common mask across the data arrays, or we implement our own semantics to deal with this. And again, we can not change any of the original data.

This is not an obscure case. This is existing code in matplotlib. I will be evaluating the current missingdata branch later today to assess its suitability for use in matplotlib.

I think I missed why your case needs NA and IGNORE to use the same API. Why can't you just use masks and IGNORE here? Best, Matthew

Benjamin Root

4:29 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 11:20 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Fri, Jul 1, 2011 at 5:17 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
On Fri, Jul 1, 2011 at 11:00 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
...
You can't switch between the two approaches without big changes in

your

...
...
...
code.

...
Lluis provided a case, and it was obscure. That switch seems like a rare or non-existent use-case that should not guide the API.

Just to respond to this specific issue.

In matplotlib, there are often constructs like the following:

plot_something(X, Y, V)

From a module perspective, we have no clue about the nature of the input data. We often have to do things like np.asanyarray, np.atleast_2d and such to establish some base-level assumptions about the input data. Numpy currently makes this fairly cheap by not performing a copy if it is not needed. So far, so good.

Next, some plotting functions needs to broadcast the arrays together (again, numpy makes that fairly cheap).

Then, we need to figure out the common elements to plot. With something simple like plot(), this is straight-forward or-ing of any masks. Of course, right now, this is not cheap because we can't assume that the array supports masking semantics. This is where we either cast the arrays as masked arrays, or perform our own masking semantics. But, essentially, a point that was masked in X, may not be masked in Y and/or V, and we can not change the original data (or else we would be a bad tool).

For more complicated functions like pcolor() and contour(), the arrays needs to know what the status of the neighboring points in itself, and for the other arrays. Again, either we use numpy.ma to share a common mask across the data arrays, or we implement our own semantics to deal with this. And again, we can not change any of the original data.

This is not an obscure case. This is existing code in matplotlib. I will be evaluating the current missingdata branch later today to assess its suitability for use in matplotlib.

I think I missed why your case needs NA and IGNORE to use the same API. Why can't you just use masks and IGNORE here?

Best,

Matthew

The point is that matplotlib can not make assumptions about the nature of the input data. From matplotlib's perspective, NA's and IGNORE's are the same thing and should be treated the same way (i.e. - skipped). Right now, matplotlib's code is messy and inconsistent with its treatment of masked arrays and NaNs (some functions treat them the same, some only apply to NaNs and vice versa). This is because of code cruft over the years. If we had one interface to rule them all, we can bring *all* plotting functions to have similar handling code and be more consistent across the board. However, I think Mark's NEP provides a good way to distinguish between the cases when needed (but I have not examined it from that perspective yet). Ben Root

Nathaniel Smith

4:03 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 9:29 AM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Fri, Jul 1, 2011 at 11:20 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 5:17 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
For more complicated functions like pcolor() and contour(), the arrays needs to know what the status of the neighboring points in itself, and for the other arrays. Again, either we use numpy.ma to share a common mask across the data arrays, or we implement our own semantics to deal with this. And again, we can not change any of the original data.

This is not an obscure case. This is existing code in matplotlib. I will be evaluating the current missingdata branch later today to assess its suitability for use in matplotlib.

I think I missed why your case needs NA and IGNORE to use the same API. Why can't you just use masks and IGNORE here?

The point is that matplotlib can not make assumptions about the nature of the input data. From matplotlib's perspective, NA's and IGNORE's are the same thing and should be treated the same way (i.e. - skipped). Right now, matplotlib's code is messy and inconsistent with its treatment of masked arrays and NaNs (some functions treat them the same, some only apply to NaNs and vice versa). This is because of code cruft over the years. If we had one interface to rule them all, we can bring *all* plotting functions to have similar handling code and be more consistent across the board.

Maybe I'm missing something, but it seems like no matter how the NA handling thing plays out, what you need is something like # For current numpy: def usable_points(a): a = np.asanyarray(a) usable = ~np.isnan(a) usable &= ~np.isinf(a) if isinstance(a, np.ma.masked_array): usable &= ~a.mask return usable def all_usable(a, *rest): usable = usable_points(a) for other in rest: usable &= usable_points(other) return usable And then you need to call all_usable from each of your plotting functions and away you go, yes? AFAICT, under the NEP proposal, in usable_points() you need to add a line like: usable &= ~np.isna(a) # NEP Under the alterNEP proposal, you need to add two lines, like usable &= ~np.isna(a) # alterNEP usable &= a.visible # alterNEP And either way, once you get your mask, you pretty much do the same thing: either use it directly, or use it to set up a masked array (of whatever flavor, and they all seem to work the same as far as this is concerned). You seem to see some way in which the alterNEP's separation of masks and NA handling makes a big difference to your architecture, but I'm not getting it :-(. -- Nathaniel

Lluís

5:39 p.m.

New subject: alterNEP - was: missing data discussion round 2

Matthew Brett writes:

...

...
...
...
Mainly: Reduced interoperability

Meaning?

You can't switch between the two approaches without big changes in your code.

...

Lluis provided a case, and it was obscure. That switch seems like a rare or non-existent use-case that should not guide the API.

The example was for an outlier detection *in-place*. I see the merged API as beneficial in cases where: * There are arguments used both as input *and* output (w.r.t. missing data information), and it is up to the *caller* to decide whether to also maintain the original data. That is, with a merged API, the caller can retain a "copy" - a view in fact - of its original data more efficiently. In the matplotlib case, the outlier detection caller might decide to pass a brand new array copy, so then the outlier detection is then implemented using np.NA (as they are both developed inside the same framework). But it may also be the case that later on, the developer decides to rewrite the caller function (for whatever reason, like avoiding a full copy of the array) as passing an array with masking activated. With the merged API the outlier detection will still work perfectly. With np.IGNORE the outlier detection code should also be changed. This is what Mark talks about when saying "interoperability", and it is a good choice from the point of view of code maintenance. * Propagation of np.NA and np.IGNORE are controlled with a single argument (thus simpler and less error-prone code), as opposed to two separate arguments and two possible outcomes (np.NA and np.IGNORE) with aNEP. I have been repeating these 2 points again and again, and I still feel they have not yet been addressed by the aNEP. Still, the only clear statement I've seen in favour of the aNEP is minimizing "surprises". And I will repeat it again. You have to *explicitly* "activate" masks, just as well as you *explicitly* use np.IGNORE, so it should not surprise you when you see a mask-like behaviour, precisely because you have asked for it. If you don't want that behaviour, you simply don't activate masks. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

Nathaniel Smith

3:15 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...
If so, what are they?

Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.

Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.* Can you give any examples of how the implementation would be more complicated? As far as I can tell there are no elements in the alterNEP that are not in your NEP, they mostly just expose the functionality differently at the top level. Do you have a clearer theoretical model for the masked part of your proposal? The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing! It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP. Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I

...

...
Do you agree that the alterNEP proposal is easier to understand?

No.

...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I understand the desire not to get caught up in spending all your time writing emails explaining things that you feel like you've already explained. Maybe there's an email I missed somewhere where you explain the conceptual model behind your NEP's semantics in a short, easy-to-understand way (comparable to, say, the Rationale section of the alterNEP). But I haven't seen it and I can't reconstruct a rationale for it myself (the alterNEP comes out of my attempts to do so!).

...

...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate.

But the two implementations in your proposal are not interchangeable! The whole justification for starting with a masked-based implementation in your proposal is that it supports unmasking via views; if that requirement were removed, then there would be no reason to bother with the masking-based implementation at all. Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.

...

The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

You can't switch between Python and C without a lot of work too, but that doesn't mean that they should be merged into one design... but they do complement each other beautifully. Just like missing data and masked arrays :-).

...

The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

I know I'm being grumpy in this email, and I apologize for that. But, no. I've given extensive feedback, read the list carefully, and thought hard about these issues, and so far you've basically just dismissed my concerns. (See, e.g., [1], where your response to "we have to choose whether it's possible to recover data after it has been masked/NAed/whatever" is "no we don't, it should be both possible and impossible", which, I mean, what?) I've done my best to express them clearly, in the best way I know how -- and that way is *not* line by line comments on your NEP, because my concerns are more fundamental than that. I am of course happy to answer questions and such if there are places where I've been unclear. And of course it's your prerogative to decide how you want to spend your time (well, yours and your employer's, I guess), which forums you want to participate in, what code you want to write, etc. If you have decided that you are tired to talking about this and want to just go off and implement something, then good luck (and I do mean that, it isn't sarcasm). But as far as I can tell right now, every single person who has experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look. So I guess my question is, once you've spent your limited time on writing this code -- how confident are you that it will be merged? This isn't a threat or anything, I have no power over what gets merged, but -- it seems to me that there's a real chance that you'll do this work and then it will go down in flames, or that it will be merged and then the people you're trying to target will ignore it anyway. This is why we try to build consensus first, right? I would love to find some way to make everyone happy (and have been doing what I can on that front), but right now I am not happy, other people are not happy, and you're communicating that you don't think that matters. I'd love for that to change. -- Nathaniel [1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html

Bruce Southey

4:18 p.m.

On 07/01/2011 10:15 AM, Nathaniel Smith wrote:

...

...
...
Do you see problems with the alterNEP proposal? Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best

On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett<matthew.brett@gmail.com> wrote: parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...
If so, what are they? Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it. Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only

On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe<mwwiebe@gmail.com> wrote: person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.*

Can you give any examples of how the implementation would be more complicated? As far as I can tell there are no elements in the alterNEP that are not in your NEP, they mostly just expose the functionality differently at the top level.

Do you have a clearer theoretical model for the masked part of your proposal? The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing! It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP. Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I

...
...
Do you agree that the alterNEP proposal is easier to understand? No. If not, can you explain why? My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP. I understand the desire not to get caught up in spending all your time writing emails explaining things that you feel like you've already explained.

Maybe there's an email I missed somewhere where you explain the conceptual model behind your NEP's semantics in a short, easy-to-understand way (comparable to, say, the Rationale section of the alterNEP). But I haven't seen it and I can't reconstruct a rationale for it myself (the alterNEP comes out of my attempts to do so!).

...
...
What do you see as the important points of difference between the NEP and the alterNEP? The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate. But the two implementations in your proposal are not interchangeable! The whole justification for starting with a masked-based implementation in your proposal is that it supports unmasking via views; if that requirement were removed, then there would be no reason to bother with the masking-based implementation at all.

Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.

...
The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work. You can't switch between Python and C without a lot of work too, but that doesn't mean that they should be merged into one design... but they do complement each other beautifully. Just like missing data and masked arrays :-).

...
The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github. I know I'm being grumpy in this email, and I apologize for that. But, no. I've given extensive feedback, read the list carefully, and thought hard about these issues, and so far you've basically just dismissed my concerns. (See, e.g., [1], where your response to "we have to choose whether it's possible to recover data after it has been masked/NAed/whatever" is "no we don't, it should be both possible and impossible", which, I mean, what?) I've done my best to express them clearly, in the best way I know how -- and that way is *not* line by line comments on your NEP, because my concerns are more fundamental than that.

I am of course happy to answer questions and such if there are places where I've been unclear.

And of course it's your prerogative to decide how you want to spend your time (well, yours and your employer's, I guess), which forums you want to participate in, what code you want to write, etc. If you have decided that you are tired to talking about this and want to just go off and implement something, then good luck (and I do mean that, it isn't sarcasm).

But as far as I can tell right now, every single person who has experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look. So I guess my question is, once you've spent your limited time on writing this code -- how confident are you that it will be merged? This isn't a threat or anything, I have no power over what gets merged, but -- it seems to me that there's a real chance that you'll do this work and then it will go down in flames, or that it will be merged and then the people you're trying to target will ignore it anyway. This is why we try to build consensus first, right? I would love to find some way to make everyone happy (and have been doing what I can on that front), but right now I am not happy, other people are not happy, and you're communicating that you don't think that matters. I'd love for that to change.

-- Nathaniel

[1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I am sorry that that is NOT true - DON'T just lump every one into this when they have clearly stated the opposite! Missing values are nothing special to me, just reality. There are many statistical applications where masking is extremely common like outlier detection and flagging unusual observations (missing values is also masking). Just that you as a user have to do that yourself by creating and maintaining working variables.

I really find that you are 'splitting hairs' in your arguments as it really has to be up to the application on how missing values and NaN have to be handled. I see no difference between a missing value and a NaN because in virtually all statistical applications, both of these are dropped. This is what SAS typically does although certain procedure like FREQ allow you to treat missing values as 'valid'. R has slightly more flexibility since it differentiates missing valves and NaN. R allows you to decide how missing values are handled using arguments like na.rm or using na.fail, na.omit, na.exclude, na.pass functions. But I think for the majority of cases (I'm not an R guru), R acts the same way as, by default (which is how most people use R) R excludes missing values and NaN's. One of the problems I see here is that numpy has to work with a wide range of situations that neither R nor SAS or any other statistical-based language/application have to deal with. So you have suggest has to work for string, integer and data/time arrays. I generally agree with what Chuck has said. But I know that while we have little say in some of numpy, we can file tickets that actually get some action. It is also how times change as this missing value topic has way more interest than previous times it has been raised. So I think we are gradually getting some positive awareness. Bruce

Matthew Brett

4:24 p.m.

New subject: alterNEP - was: missing data discussion round 2

Hi, On Fri, Jul 1, 2011 at 5:18 PM, Bruce Southey <bsouthey@gmail.com> wrote:

...

On 07/01/2011 10:15 AM, Nathaniel Smith wrote:

...

I really find that you are 'splitting hairs' in your arguments as it really has to be up to the application on how missing values and NaN have to be handled. I see no difference between a missing value and a NaN because in virtually all statistical applications, both of these are dropped.

The argument is that NA and IGNORE are conceptually different and should have a separate API. That if you don't, it will be confusing. By default, in alterNEP, NAs propagate and masked values are ignored. If you want to treat them just the same, then that's an argument to your ufunc. Or use an 'isvalid' utility function. Do you have a concrete case where making NA and IGNORE the same thing in the API, gives some benefit? Best, Matthew

Nathaniel Smith

3:47 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 9:18 AM, Bruce Southey <bsouthey@gmail.com> wrote:

...

I am sorry that that is NOT true - DON'T just lump every one into this when they have clearly stated the opposite! Missing values are nothing special to me, just reality. There are many statistical applications where masking is extremely common like outlier detection and flagging unusual observations (missing values is also masking). Just that you as a user have to do that yourself by creating and maintaining working variables.

Thanks for speaking up -- we all definitely want something that will work as well as possible for everyone! I'm a little confused about what you're saying, though -- I assume that you mean that you're happy with the NEP proposal for handling NA values[1], and so I misrepresented you when I said that everyone doing statistics with missing values had concerns about the NEP? If so, then my apologies. [1] https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e...

...

I really find that you are 'splitting hairs' in your arguments as it really has to be up to the application on how missing values and NaN have to be handled. I see no difference between a missing value and a NaN because in virtually all statistical applications, both of these are dropped. This is what SAS typically does although certain procedure like FREQ allow you to treat missing values as 'valid'. R has slightly more flexibility since it differentiates missing valves and NaN. R allows you to decide how missing values are handled using arguments like na.rm or using na.fail, na.omit, na.exclude, na.pass functions. But I think for the majority of cases (I'm not an R guru), R acts the same way as, by default (which is how most people use R) R excludes missing values and NaN's.

Is your point here that NA and NaN are pretty similar, so it's splitting hairs to differentiate them? They are pretty similar, but this is the justification I wrote for having both in the alterNEP (https://gist.github.com/1056379): "For floating point computations, NAs and NaNs have (almost?) identical behavior. But they represent different things -- NaN an invalid computation like 0/0, NA a value that is not available -- and distinguishing between these things is useful because in some situations they should be treated differently. (For example, an imputation procedure should replace NAs with imputed values, but probably should leave NaNs alone.) And anyway, we can't use NaNs for integers, or strings, or booleans, so we need NA anyway, and once we have NA support for all these types, we might as well support it for floating point too for consistency." Does that seem reasonable? In any case, my arguments haven't really been about NA versus NaN -- everyone seems to agree that we want something like NA. In the NEP proposal, there are two different versions of NAs, one that's implemented using special values (e.g., a special NaN that means NA) and one that's implemented by using a secondary mask array. My argument has been that for people who just want NAs, this secondary mask version is redundant and confusing; but the mask version doesn't really help the people who want "masked arrays" either, because it's working too hard to be compatible with NAs, and the masked array people want different behavior (unmasking, automatic skipping of NAs, etc.). So it doesn't really work well for anybody. -- Nathaniel

Christopher Jordan-Squire

4:29 p.m.

New subject: alterNEP - was: missing data discussion round 2

This is kind of late to be jumping into the 'long thread of doom', but I've been following most of the posts, so I'd figured I'd throw in my 2 cents. I'm Mark's officemate over the summer, and we've been talking daily about his design. I was skeptical of various details at first, but by now Mark's largely sold me on his design. Though, FWIW, my background is largely statistical uses of arrays rather than scientific uses, so I grok missing data usage more naturally than masking. On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...
If so, what are they?

Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.

Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.*

Can you give any examples of how the implementation would be more complicated? As far as I can tell there are no elements in the alterNEP that are not in your NEP, they mostly just expose the functionality differently at the top level.

Do you have a clearer theoretical model for the masked part of your proposal? The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing! It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP. Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I

I looked over the theoretical mode in the aNEP, and I disagree with it. I think a masked array is just that: an array with a mask. Do whatever with the mask, but it's up to the user to decide how they want to use it. It doesn't seem like it has to come with a theoretical model. (Unlike missing data, which comes which does have a nice theoretical model.) The theoretical model in the aNEP seems to assume too much. I'm thinking in particular of this idea: "a length-4 array in which the last value has been masked out behaves just like an ordinary length-3 array, so long as you don't change the mask." That's forcing a notion of column/position independence on the masked array, in that any function operating on the rows must treat each column the same. And I'm don't think that's part of the contract that should come from creating a masked array.

...

...
Do you agree that the alterNEP proposal is easier to understand?

No.

...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I understand the desire not to get caught up in spending all your time writing emails explaining things that you feel like you've already explained.

Maybe there's an email I missed somewhere where you explain the conceptual model behind your NEP's semantics in a short, easy-to-understand way (comparable to, say, the Rationale section of the alterNEP). But I haven't seen it and I can't reconstruct a rationale for it myself (the alterNEP comes out of my attempts to do so!).

...
...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate.

But the two implementations in your proposal are not interchangeable! The whole justification for starting with a masked-based implementation in your proposal is that it supports unmasking via views; if that requirement were removed, then there would be no reason to bother with the masking-based implementation at all.

Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.

...
The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

You can't switch between Python and C without a lot of work too, but that doesn't mean that they should be merged into one design... but they do complement each other beautifully. Just like missing data and masked arrays :-).

...
The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

I know I'm being grumpy in this email, and I apologize for that. But, no. I've given extensive feedback, read the list carefully, and thought hard about these issues, and so far you've basically just dismissed my concerns. (See, e.g., [1], where your response to "we have to choose whether it's possible to recover data after it has been masked/NAed/whatever" is "no we don't, it should be both possible and impossible", which, I mean, what?) I've done my best to express them clearly, in the best way I know how -- and that way is *not* line by line comments on your NEP, because my concerns are more fundamental than that.

I am of course happy to answer questions and such if there are places where I've been unclear.

And of course it's your prerogative to decide how you want to spend your time (well, yours and your employer's, I guess), which forums you want to participate in, what code you want to write, etc. If you have decided that you are tired to talking about this and want to just go off and implement something, then good luck (and I do mean that, it isn't sarcasm).

But as far as I can tell right now, every single person who has experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look. So I guess my question is, once you've spent your limited time on writing this code -- how confident are you that it will be merged? This isn't a threat or anything, I have no power over what gets merged, but -- it seems to me that there's a real chance that you'll do this work and then it will go down in flames, or that it will be merged and then the people you're trying to target will ignore it anyway. This is why we try to build consensus first, right? I would love to find some way to make everyone happy (and have been doing what I can on that front), but right now I am not happy, other people are not happy, and you're communicating that you don't think that matters. I'd love for that to change.

I'm a statistics grad students and an R user, and I'm mostly ok with what Mark is doing. Currently, as I understand it, Mark is working on a structure that will make missing data into a first class citizen in the numpy world. This is great! Before it had been more of a 2nd class-citizen. And Mark is even trying to copy R semantics as much as possible. It's true that Mark's making it so the masked part of these new arrays won't be as front and center. The functionality will be there and it will be easy to used. But it will be based more on an explicit contract that the data memory contents of a masked array will not be overwritten when the data is masked. So I don't think Mark is making anything implicit--he's making a very explicit contract about how the data memory is handled when the mask is changed. If I understand correctly, it seems like the main objection to Mark's current API is that the explicit contract about data memory isn't somehow immediately visible in the API. It's true this is a trade-off, but it leads to a simpler API with easier ability to use all features at once at the pretty small cost of the user just having to read enough to realize that there's an explicit contract about what happens to the memory of a masked value, and they can access it by taking a view. That's easy enough to add at the very beginning of the documentation. -Chris JS

...

-- Nathaniel

[1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

4:40 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 9:29 AM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

This is kind of late to be jumping into the 'long thread of doom', but I've been following most of the posts, so I'd figured I'd throw in my 2 cents. I'm Mark's officemate over the summer, and we've been talking daily about his design. I was skeptical of various details at first, but by now Mark's largely sold me on his design. Though, FWIW, my background is largely statistical uses of arrays rather than scientific uses, so I grok missing data usage more naturally than masking.

Always good to hear more perspectives! Thanks for speaking up.

...

I looked over the theoretical mode in the aNEP, and I disagree with it. I think a masked array is just that: an array with a mask. Do whatever with the mask, but it's up to the user to decide how they want to use it. It doesn't seem like it has to come with a theoretical model. (Unlike missing data, which comes which does have a nice theoretical model.)

I'm not sure what you mean here. If we have masked array support at all (and some people seem to want it), then we have to say more than "it's an array with a mask". Indexing such a beast has to do *something*, so we need some kind of theory to say what, ufuncs have to do *something*, ditto. I mean, I guess we could just say that a masked array is literally an np.ndarray where you have attached a field named "mask" that doesn't do anything, but I don't think that would really satisfy most users :-).

...

The theoretical model in the aNEP seems to assume too much. I'm thinking in particular of this idea: "a length-4 array in which the last value has been masked out behaves just like an ordinary length-3 array, so long as you don't change the mask." That's forcing a notion of column/position independence on the masked array, in that any function operating on the rows must treat each column the same. And I'm don't think that's part of the contract that should come from creating a masked array.

...

...
...
np.sum(np.array([1, 2, 3, np.IGNORE])) 6 Why? Because that's what happens when we do this: np.sum(np.array([1, 2, 3])) 6 There are other ways to think about how masked arrays should act, but

I'm really lost on what you mean by columns versus rows here. In that sentence I'm literally saying that these two 1-d arrays should behave the same: [1, 2, 3] [1, 2, 3, --] For example, we have to decide what np.sum should do on the second array. Well, this says that it should work like this: this seemed like one plausible heuristic to put out there as a starting point. ...If you still have an objection, could you rephrase it? And any thoughts on how I could phrase that better?

...

I'm a statistics grad students and an R user, and I'm mostly ok with what Mark is doing. Currently, as I understand it, Mark is working on a structure that will make missing data into a first class citizen in the numpy world. This is great! Before it had been more of a 2nd class-citizen. And Mark is even trying to copy R semantics as much as possible.

Yes, It's wonderful!

...

It's true that Mark's making it so the masked part of these new arrays won't be as front and center. The functionality will be there and it will be easy to used. But it will be based more on an explicit contract that the data memory contents of a masked array will not be overwritten when the data is masked. So I don't think Mark is making anything implicit--he's making a very explicit contract about how the data memory is handled when the mask is changed. If I understand correctly, it seems like the main objection to Mark's current API is that the explicit contract about data memory isn't somehow immediately visible in the API. It's true this is a trade-off, but it leads to a simpler API with easier ability to use all features at once at the pretty small cost of the user just having to read enough to realize that there's an explicit contract about what happens to the memory of a masked value, and they can access it by taking a view. That's easy enough to add at the very beginning of the documentation.

I don't know about others, but my main objection is this: He's proposing two different implementations for NA. I only need one, so having two is redundant and confusing. Of these two, the bit-pattern one has lower memory overhead (which many people have spoken up to say matters to them), and really obvious semantics (assignment is implemented as assignment, etc.). So why force people to make this confusing choice? What does the mask implementation add? AFAICT, its only purpose is to satisfy a rather different set of use cases. (See Gary Strangman's email here for a good description of these use cases: http://www.mail-archive.com/numpy-discussion@scipy.org/msg32385.html) But AFAICT again, it's been crippled for those use cases in order to give it the NA semantics. So I just don't see who the masking part is supposed to help. BTW, you can't access the memory of a masked value by taking a view, at least if I'm reading this version of the NEP correctly, and it seems to be the latest: https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e... The only way to access the memory of a masked value is take a view *before* you mask it. And if the array has a mask at all when you take the view, you also have to set a.flags.ownmask = True, before you mask the value. -- Nathaniel

Eric Firing

5:07 a.m.

New subject: alterNEP - was: missing data discussion round 2

On 07/01/2011 06:40 PM, Nathaniel Smith wrote:

...

On Fri, Jul 1, 2011 at 9:29 AM, Christopher Jordan-Squire

...

BTW, you can't access the memory of a masked value by taking a view, at least if I'm reading this version of the NEP correctly, and it seems to be the latest: https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e...

No, to see the latest you need to go to pull request #99, I believe: https://github.com/numpy/numpy/pull/99 From there click the diff button, then select doc/neps/missing-data.rst, then "view file" to get to a formatted view of the whole file in its most recent form. You can also look at the history of the file there. c-masked-array.rst was renamed to missing-data.rst and editing continued. Eric

Nathaniel Smith

2:34 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 10:07 PM, Eric Firing <efiring@hawaii.edu> wrote:

...

On 07/01/2011 06:40 PM, Nathaniel Smith wrote:

...
On Fri, Jul 1, 2011 at 9:29 AM, Christopher Jordan-Squire

...
BTW, you can't access the memory of a masked value by taking a view, at least if I'm reading this version of the NEP correctly, and it seems to be the latest: https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e...

No, to see the latest you need to go to pull request #99, I believe: https://github.com/numpy/numpy/pull/99 From there click the diff button, then select doc/neps/missing-data.rst, then "view file" to get to a formatted view of the whole file in its most recent form. You can also look at the history of the file there. c-masked-array.rst was renamed to missing-data.rst and editing continued.

Oh. Thanks for the link! Fortunately, I'm not seeing any changes that invalidate anything I've said here. The disappearance of .validitymask changes the details of my response earlier to Pierre, but not the content, I think. But sorry for the confusion. -- Nathaniel

Benjamin Root

8:10 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 11:40 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

I'm not sure what you mean here. If we have masked array support at all (and some people seem to want it), then we have to say more than "it's an array with a mask". Indexing such a beast has to do *something*, so we need some kind of theory to say what, ufuncs have to do *something*, ditto. I mean, I guess we could just say that a masked array is literally an np.ndarray where you have attached a field named "mask" that doesn't do anything, but I don't think that would really satisfy most users :-).

Indexing a masked array just returns an array with np.NA in the appropriate elements. This is no different than with regular ndarray objects or in numpy.ma. As for ufuncs, the NEP already addresses this in multiple ways. For element-wise ufuncs, a "where" parameter is available for indicating which elements to skip. For reduction ufuncs, a "skipna" parameter will indicate whether or not to skip the values. On top of that, subclassed ndarrays (such as numpy.ma, I guess) can create a __ufunc_wrap__ function that can set a default value for those parameters to make things easier for masked array users. I don't know about others, but my main objection is this: He's

...

proposing two different implementations for NA. I only need one, so having two is redundant and confusing. Of these two, the bit-pattern one has lower memory overhead (which many people have spoken up to say matters to them), and really obvious semantics (assignment is implemented as assignment, etc.). So why force people to make this confusing choice? What does the mask implementation add? AFAICT, its only purpose is to satisfy a rather different set of use cases. (See Gary Strangman's email here for a good description of these use cases: http://www.mail-archive.com/numpy-discussion@scipy.org/msg32385.html) But AFAICT again, it's been crippled for those use cases in order to give it the NA semantics. So I just don't see who the masking part is supposed to help.

As a user of numpy.ma, masked arrays have always been a second-class citizen to me. Developing new code with it always brought about new surprises and discoveries of strange behavior from various functions. In this sense, numpy.ma has always been crippled. By sacrificing *some* of the existing semantics (which would likely be taken care of by a re-implemented numpy.mato preserve backwards-compatibility), the masked array community gains a first-class citizen in numpy, and numpy developers will have the masked/missing data issue in the forefront whenever developing new functions and libraries. I am more than happy with that trade-off. I am willing to learn to semantics so long as I have a guarantee that the functions I use behaves the way I expect them to.

...

BTW, you can't access the memory of a masked value by taking a view, at least if I'm reading this version of the NEP correctly, and it seems to be the latest:

https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e... The only way to access the memory of a masked value is take a view *before* you mask it. And if the array has a mask at all when you take the view, you also have to set a.flags.ownmask = True, before you mask the value.

This isn't actually as bad as it sounds. From a function's perspective, it should only know the values that it has been given access to. If I -- as a user of said function -- decide that certain values should be unknown to the function, I wouldn't want the function to be able to override that decision. Remember, it is possible that the masked element never was initialized. Therefore, we wouldn't want the function to use that element. (Note, this is one of those "fun" surprises that a numpy.ma user sometimes encounters when a function uses np.asarray instead of np.asanyarray). Ben Root

josef.pktd＠gmail.com

2:35 a.m.

New subject: alterNEP - was: missing data discussion round 2

On Sat, Jul 2, 2011 at 4:10 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Fri, Jul 1, 2011 at 11:40 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
I'm not sure what you mean here. If we have masked array support at all (and some people seem to want it), then we have to say more than "it's an array with a mask". Indexing such a beast has to do *something*, so we need some kind of theory to say what, ufuncs have to do *something*, ditto. I mean, I guess we could just say that a masked array is literally an np.ndarray where you have attached a field named "mask" that doesn't do anything, but I don't think that would really satisfy most users :-).

Indexing a masked array just returns an array with np.NA in the appropriate elements. This is no different than with regular ndarray objects or in numpy.ma. As for ufuncs, the NEP already addresses this in multiple ways. For element-wise ufuncs, a "where" parameter is available for indicating which elements to skip. For reduction ufuncs, a "skipna" parameter will indicate whether or not to skip the values. On top of that, subclassed ndarrays (such as numpy.ma, I guess) can create a __ufunc_wrap__ function that can set a default value for those parameters to make things easier for masked array users.

...
I don't know about others, but my main objection is this: He's proposing two different implementations for NA. I only need one, so having two is redundant and confusing. Of these two, the bit-pattern one has lower memory overhead (which many people have spoken up to say matters to them), and really obvious semantics (assignment is implemented as assignment, etc.). So why force people to make this confusing choice? What does the mask implementation add? AFAICT, its only purpose is to satisfy a rather different set of use cases. (See Gary Strangman's email here for a good description of these use cases: http://www.mail-archive.com/numpy-discussion@scipy.org/msg32385.html) But AFAICT again, it's been crippled for those use cases in order to give it the NA semantics. So I just don't see who the masking part is supposed to help.

As a user of numpy.ma, masked arrays have always been a second-class citizen to me. Developing new code with it always brought about new surprises and discoveries of strange behavior from various functions. In this sense, numpy.ma has always been crippled. By sacrificing *some* of the existing semantics (which would likely be taken care of by a re-implemented numpy.ma to preserve backwards-compatibility), the masked array community gains a first-class citizen in numpy, and numpy developers will have the masked/missing data issue in the forefront whenever developing new functions and libraries. I am more than happy with that trade-off. I am willing to learn to semantics so long as I have a guarantee that the functions I use behaves the way I expect them to.

...
BTW, you can't access the memory of a masked value by taking a view, at least if I'm reading this version of the NEP correctly, and it seems to be the latest:

https://github.com/m-paradox/numpy/blob/4afdb2768c4bb8cfe47c21154c4c8ca5f85e... The only way to access the memory of a masked value is take a view *before* you mask it. And if the array has a mask at all when you take the view, you also have to set a.flags.ownmask = True, before you mask the value.

This isn't actually as bad as it sounds. From a function's perspective, it should only know the values that it has been given access to. If I -- as a user of said function -- decide that certain values should be unknown to the function, I wouldn't want the function to be able to override that decision. Remember, it is possible that the masked element never was initialized. Therefore, we wouldn't want the function to use that element. (Note, this is one of those "fun" surprises that a numpy.ma user sometimes encounters when a function uses np.asarray instead of np.asanyarray).

But as far as I understand this takes away the ability to temporarily fill in the masked values with values that are neutral for a calculation, e.g. zero when taking a sum or dot product. Instead it looks like a copy of the array has to be made in the new version. (I'm thinking more correlate, convolution, linalg, scipy.signal, not simple ufuncs. In many cases new arrays might be created anyway so the loss from getting a copy of the non-NA data might not be so severe.) I guess the "fun" surprises will remain fun since most function in scipy or other libraries won't suddenly learn how to handle masked arrays or NAs. What happens if you feed the new animals to linalg.svd, or linalg.inv or fft ... that are all designed for asarray and not for asanyarray? Josef

...

Ben Root

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Mark Wiebe

5:14 p.m.

New subject: alterNEP - was: missing data discussion round 2

On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...
If so, what are they?

Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.

Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.*

I've given examples before, but here are a few: 1) You're using NA dtypes. You realize you want multiple views of the same data with different choices of NA. You switch to masked arrays with a few lines of code changes. 2) You're using masks. You realize that you will save memory/disk space if you switch to NA dtypes, and it's possible because it turned out that while you thought you would need masking, you came up with a new algorithm that didn't require it. 3) You're writing matplotlib, and you want to support all forms of NA-style data. You write it once instead of twice. Repeat for all other open source libraries that want to do this.

...

Can you give any examples of how the implementation would be more complicated? As far as I can tell there are no elements in the alterNEP that are not in your NEP, they mostly just expose the functionality differently at the top level.

If that is the case, then it should be easy to change to your model after the implementation is complete. I'm happy with that, these style of design choices are easier to make when you're comparing actual usage than hypotheticals. Do you have a clearer theoretical model for the masked part of your

...

proposal?

Yes, exactly the same model used for NA dtypes.

...

The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing!

Ok, missing data-like functionality, which is provided by the solid theory behind the missing data.

...

It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP.

Once you've hidden it, isn't it now missing?

...

Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I

...

...
...
Do you agree that the alterNEP proposal is easier to understand?

No.

...
If not, can you explain why?

My answers to that are already scattered in the emails in various places, and in the various rationales and justifications provided in the NEP.

I understand the desire not to get caught up in spending all your time writing emails explaining things that you feel like you've already explained.

Maybe there's an email I missed somewhere where you explain the conceptual model behind your NEP's semantics in a short, easy-to-understand way (comparable to, say, the Rationale section of the alterNEP). But I haven't seen it and I can't reconstruct a rationale for it myself (the alterNEP comes out of my attempts to do so!).

I've been repeatedly updating the NEP. In particular this "round 2" email was an attempt to clarify between the two missing data models (what's being called NA and IGNORE), and the two implementation techniques (NA bit patterns and masks). I've argued that these are completely independent from each other.

...

...
...
What do you see as the important points of difference between the NEP and the alterNEP?

The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear missing data abstraction, and proposes two implementations that are interchangeable and can interoperate.

But the two implementations in your proposal are not interchangeable! The whole justification for starting with a masked-based implementation in your proposal is that it supports unmasking via views; if that requirement were removed, then there would be no reason to bother with the masking-based implementation at all.

They are interchangeable 100% with regard to the missing data semantics. Views are an orthogonal feature, and it is through composition of these two features that the masks gain this power.

...

Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.

...
The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching between the two approaches without a lot of work.

You can't switch between Python and C without a lot of work too, but that doesn't mean that they should be merged into one design... but they do complement each other beautifully. Just like missing data and masked arrays :-).

This last statement is why I feel like you haven't been reading my emails. I've clearly positioned masks as an implementation technique, not implying any specific semantics.

...

...
The current pull request that's sitting there waiting for review does not have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP that way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.

I know I'm being grumpy in this email, and I apologize for that. But, no. I've given extensive feedback, read the list carefully, and thought hard about these issues, and so far you've basically just dismissed my concerns. (See, e.g., [1], where your response to "we have to choose whether it's possible to recover data after it has been masked/NAed/whatever" is "no we don't, it should be both possible and impossible", which, I mean, what?) I've done my best to express them clearly, in the best way I know how -- and that way is *not* line by line comments on your NEP, because my concerns are more fundamental than that.

I've likewise read your emails carefully, and really appreciated that you jumped in right at the beginning with a good explanation of R's missing value semantics. I think line by line comments on the NEP expressing where the fundamental problems would help us communicate better. I've tried to tease apart the distinction between the missing value abstractions and the implementation techniques, and I haven't seen the fact that you read that reflected in your emails. If you have a good reason why implementing something with masks implies certain semantics, please explain, dealing with the points that I've laid out arguing for this design choice in the latest NEP, accessible via the pull request. I am of course happy to answer questions and such if there are places

...

where I've been unclear.

And of course it's your prerogative to decide how you want to spend your time (well, yours and your employer's, I guess), which forums you want to participate in, what code you want to write, etc. If you have decided that you are tired to talking about this and want to just go off and implement something, then good luck (and I do mean that, it isn't sarcasm).

I do want to constructively engage the community at the same time as I do the implementation, and I have a track record of producing good interfaces even when the underlying functionality is complex. I've had very positive feedback about einsum from people who deal with multiple arrays of multidimensional data and were missing an easy way to do that kind of operation. But as far as I can tell right now, every single person who has

...

experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look. So I guess my question is, once you've spent your limited time on writing this code -- how confident are you that it will be merged? This isn't a threat or anything, I have no power over what gets merged, but -- it seems to me that there's a real chance that you'll do this work and then it will go down in flames, or that it will be merged and then the people you're trying to target will ignore it anyway. This is why we try to build consensus first, right? I would love to find some way to make everyone happy (and have been doing what I can on that front), but right now I am not happy, other people are not happy, and you're communicating that you don't think that matters. I'd love for that to change.

Building consensus is general virtually impossible, I'm for example very impressed with the C++ standards committee's success in achieving it where they have. My development process is different from what you're describing, Like with datetime, I am merging periodically, not doing one big merge at the end. There's a reason why design by committee is frowned upon. The feedback is great, but still needs to go through a very strict software design quality filter. -Mark

...

-- Nathaniel

[1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Lluís

5:47 p.m.

New subject: alterNEP - was: missing data discussion round 2

Nathaniel Smith writes:

...

On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Do you see problems with the alterNEP proposal?

Yes, I really like my design as it stands now, and the alterNEP removes a lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.

...
If so, what are they?

Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.

...

Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.*

Interoperability improves code maintenance, see my other mail. [...]

...

Do you have a clearer theoretical model for the masked part of your proposal? The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing! It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP. Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I

Come on, let's not jump into each other's throats, I think we've long ago arrived at a point where we all know what masked means. If you agree on the interoperability point, then I don't see how the aNEP improves on that, having in mind that masks must be *explicitly* activated (again, see the other mail). [...]

...

Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.

I for one would love that, instead of having to explicitly set dtypes when using genfromtxt. [...]

...

But as far as I can tell right now, every single person who has experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look.

What I have seen is that people used to R see the mask concept as an alien, and said "I don't want to use it, so please make it more explicit so that I will know what to avoid". What I say is that you simply don't have to make np.IGNORE explicit to avoid masks. Simply do not create arrays with masks. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

4971

Age (days ago)

4974

Last active (days ago)

List overview

Download

45 comments

15 participants

participants (15)

Benjamin Root
Bruce Southey
Charles R Harris
Christopher Barker
Christopher Jordan-Squire
Dag Sverre Seljebotn
Eric Firing
Gary Strangman
josef.pktd＠gmail.com
Keith Goodman
Lluís
Mark Wiebe
Matthew Brett
Nathaniel Smith
Pierre GM

alterNEP - was: missing data discussion round 2

Pierre GM

Pierre GM

Lluís

Lluís

Lluís

Keith Goodman

Benjamin Root

Benjamin Root

Lluís

Bruce Southey

Christopher Jordan-Squire

Benjamin Root

Lluís

tags

participants (15)