Mailman 3 using the same vocabulary for missing value ideas - NumPy-Discussion

newer
reading in files with fixed with...

using the same vocabulary for missing value ideas

older
miniNEP 2: NA support via special...

Mark Wiebe

6 Jul 2011 6 Jul '11

3:40 p.m.

It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project. IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase. The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile. Thanks, Mark

Attachments:

attachment.htm (text/html — 2.9 KB)

Show replies by date

Peter

6 Jul 6 Jul

4:33 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them.

That sounds good - I've only been scanning these discussions and it is confusing.

...

NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Could you expand that to say how sums and products act with NA (since you do so for the IGNORE case). Thanks, Peter

Mark Wiebe

6:42 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 11:33 AM, Peter < numpy-discussion@maubp.freeserve.co.uk> wrote:

...

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them.

That sounds good - I've only been scanning these discussions and it is confusing.

...
NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Could you expand that to say how sums and products act with NA (since you do so for the IGNORE case).

I've added that, here's the new version: NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. For sums and products this means to produce NA if any of the inputs are NA. This is the same as NA in the R project. Thanks, -Mark

...

Thanks,

Peter _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthew Brett

4:38 p.m.

New subject: using the same vocabulary for missing value ideas

Hi, On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

...

IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase.

The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile.

I agree that there has been some confusion due to the terms. However, I continue to believe that the discussion is substantial and not due to confusion. Let us then characterize the substantial discussion as this: NEP: bitpattern and masked out values should be made nearly impossible to distinguish in the API alterNEP: bitpattern and masked out values should be distinct in the API so that it can be made clear which is meant (and therefore, implicitly, how they are implemented). Do you agree that this is the discussion? See you, Matthew

Peter

4:48 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...

Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

...

...
The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent. Peter

Matthew Brett

5:01 p.m.

New subject: using the same vocabulary for missing value ideas

Hi, On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote:

...

On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

...

...
...
The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent.

Yes. Although, we can see from the implementations that we have to hand that a) bitpatterns -> propagation (NaN-like) semantics by default (R) b) masks -> ignore semantics by default (masked arrays) I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP. I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics. Cheers, Matthew

Benjamin Root

5:11 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

...
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been

talking

...
...
past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting

On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote: point

...
...
...
which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP.

...

...
...
...
The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent.

Yes.

Good, that's all Mark's definition guide is trying to do.

...

Although, we can see from the implementations that we have to hand that

a) bitpatterns -> propagation (NaN-like) semantics by default (R) b) masks -> ignore semantics by default (masked arrays)

The above is extraneous and out of the scope of Mark's definitions. We are taking this little-by-little.

...

I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP.

Then that is what we will debate *later*, once we establish definitions.

...

I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics.

Good! I think that is what Mark wanted to get across in this set of definitions. It kinda seems like you are champing at the bit here to continue the debate, but I agree with Mark that after yesterday's discussion, we need to make sure that we have a solid foundation for understanding each other. Ben Root

Matthew Brett

5:44 p.m.

New subject: using the same vocabulary for missing value ideas

Hi, On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote:

...
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP.

I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on. I don't think anyone disputes (or has ever disputed) that: There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics. Let's not discuss that any more; we all agree. So what do you think is the source of the disagreement? Or are you saying that there should be no disagreement at this stage? Cheers, Matthew

Christopher Jordan-Squire

6:10 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote:

...
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <

...
...
...
wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com>

wrote:

...
...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in

matthew.brett@gmail.com> this

...
...
...
email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP.

I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on.

I don't think anyone disputes (or has ever disputed) that:

There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics.

So, to be clear, is your concern is that you want to be able to tell difference between whether an np.NA comes from the bit pattern or the mask in its implementation? But why would you have both the parameterized dtype and the mask implementation at the same time? They implement the same abstraction. Is your desire that the np.NA's are implemented solely through bit patterns and np.IGNORE is implemented solely through masks? So that you can think of the masks as being IGNORE flags? What if you want multiple types of IGNORE? (To ignore certain values because they're outliers, others because the data wouldn't make sense, and others because you're just focusing on a particular subgroup, for instance.) A related question is if the IGNORE values could just be another NA value? I don't understand what the specific problem would be with having several NA values, say NA(1), NA(2), ..., and then letting the user decide that NA(1) means NA in the sense discussed above and NA(2) means IGNORE. Then the ufuncs could be told whether to ignore or propagate each type of NA value. Could you explain to me if this would resolve your concerns about NA/IGNORE, or possibly give a few examples if it doesn't? Because I am still rather confused. Let's not discuss that any more; we all agree. So what do you think

...

is the source of the disagreement?

Or are you saying that there should be no disagreement at this stage?

Cheers,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthew Brett

7 Jul 7 Jul

12:01 a.m.

New subject: using the same vocabulary for missing value ideas

Hi, On Wed, Jul 6, 2011 at 7:10 PM, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote:

...
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote: > It appears to me that one of the biggest reason some of us have > been > talking > past each other in the discussions is that different people have > different > definitions for the terms being used. Until this is thoroughly > cleared > up, I > feel the design process is tilting at windmills. > In the interests of clarity in our discussions, here is a starting > point > which is consistent with the NEP. These definitions have been added > in > a > glossary within the NEP. If there are any ideas for amendments to > these > definitions that we can agree on, I will update the NEP with those > amendments. Also, if I missed any important terms which need to be > added, > please propose definitions for them. > NA (Not Available) > A placeholder for a value which is unknown to computations. > That > value may be temporarily hidden with a mask, may have been lost > due to hard drive corruption, or gone for any number of > reasons. > This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

Of course R is not an example of that. Nothing is. This is merely conceptual. Separate NA from np.NA in Mark's NEP, and you will see his point. Consider it the logical intersection of NA in Mark's NEP and the aNEP.

I am trying to work out what you feel you feel the points of discussion are. There's surely no point in continuing to debate things we agree on.

I don't think anyone disputes (or has ever disputed) that:

There can be missing data implemented with bitpatterns There can be missing data implemented with masks Missing data can have propagate semantics Missing data can have ignore semantics. The implementation does not in itself constrain the semantics.

So, to be clear, is your concern is that you want to be able to tell difference between whether an np.NA comes from the bit pattern or the mask in its implementation? But why would you have both the parameterized dtype and the mask implementation at the same time? They implement the same abstraction.

In Mark's mind they implement the same abstraction. In my mind, and Nathaniels, and I think, Pierre's, and others, they are not the same abstraction. You can treat them the same if you want, even by default, but they are two different ideas, with two different implementations. A bitmask NA value is absolutely completely missing. It's a value that says 'missing' A masked-out value is temporarily or provisionally missing. When you take away the mask, the previous value is there. These are two different things. They are each very easy to explain.

...

Is your desire that the np.NA's are implemented solely through bit patterns and np.IGNORE is implemented solely through masks? So that you can think of the masks as being IGNORE flags? What if you want multiple types of IGNORE? (To ignore certain values because they're outliers, others because the data wouldn't make sense, and others because you're just focusing on a particular subgroup, for instance.)

Forgive me, I have been at dinner and had several glasses of wine. So, what I'm about to say might be dumber than usual. With that rider: I agree with Mark, we should avoid np.IGNORE because it conflates ignore semantics with the masking implementation. The idea of several different missings seems to me orthogonal. There can be different missings with bitmasks and different missings with masks. My fundamental point, that I accept I am not getting across with much success, is the following: In general, as Dag has pointed out elsewhere, numpy is close the metal - you can almost feel the C array underneath the python numpy object. This is its strength. It doesn't try and hide the C array from you, it gives you the whole machinery, open kimono. I can see an open kimono way of dealing with missing values. There's the bitpattern way. If I do a[3] = np.NA, what I mean is 'store an NA in the array memory'. Exactly the same as when I do a[3] = 2, I mean 'store a 2 in the array memory'. It's obvious and transparent, easy to explain. I can see an open kimono way of doing masking. I make a masked array. The masked array has a 'mask'. I can set the mask values to "True" or "False". I can get the array from underneath the mask. It's obvious and transparent, easy to explain. I can see that you might want, for practical purposes, to treat these two 'missing' signals as being equivalalent. I can even see that you might not expose machinery to distinguish between them. But, it seems ugly and confusing to me, and to others, to try and make the bitpattern and the masked missing value appear to be exactly the same. If I do this: a[3] = np.NA I want an NA in a[3]. I don't want you to make it look as if there's an NA in a[3], I want there to be an NA in a[3]. I want to know what I did. So, maybe I want to 'mask' a[3]. Well then I make a masked array, and then I do a.mask[3] = False # or True. It's obvious. It's explicit. It does what I want. I can feel the C array and the mask array underneath. I know what I did. On the other hand, to try and conceal these implementation differences, seems to me to break my feeling for numpy arrays, and make me feel I have an object that is rather magic, that I don't fully understand, and for which clever stuff is going on, under the hood, that I worry about but have to trust. I think this is not the numpy way. I think I fully understand why it's attractive, but I continue to think that it's a mistake, and one that may take some time to become clear. It will become clear only after a few years of trying to teach people, and noticing that when they get to this stuff, they start switching off, and getting a bit confused, and concluding it's all too hard for them. I can see that we're starting to go round in circles again, and that writing when drunk is unlikely to help that, so at this point, I will drop out of the conversation and let y'all get on with it. Thanks for the substantial question by the way, it was helpful, Cheers, Matthew

Gary Strangman

1:09 a.m.

(snip discussion of open kimono)

...

On the other hand, to try and conceal these implementation differences, seems to me to break my feeling for numpy arrays, and make me feel I have an object that is rather magic, that I don't fully understand, and for which clever stuff is going on, under the hood, that I worry about but have to trust.

To weigh-in as someone less tipsy, I totally agree with this concern. In fact, in trying to understand the proposal myself--and I use numpy R NAs all the time--it was difficult to understand, and I don't think I have fully gotten it yet. That makes it seem like magic, and magic makes me seriously nervous ... specifically, that I won't get what I intended, which will lead to nearly-impossible-to-find bugs.

...

I think this is not the numpy way. I think I fully understand why it's attractive, but I continue to think that it's a mistake, and one that may take some time to become clear. It will become clear only after a few years of trying to teach people, and noticing that when they get to this stuff, they start switching off, and getting a bit confused, and concluding it's all too hard for them.

Agreed. For ultra simplicity, I'd be perfectly happy with a np.NA element (bitpattern?) that I could use to represent points that will forevermore be missing, as well as a masking capability that allows multiple masking values (not just true/false) such as: a.mask[3] = 0 # unmasked a.mask[3] = 1 # masked "type 1" (eg, missing?) a.mask[3] = 2 # masked "type 2" (eg, data from different source) a.mask[3] = 3 # masked "type 3" (eg, ignore in complete-case analysis) etc. Regardless of whether a mask is boolean or more, though, the simplicity of explaining masking separate from NA cases is, I think, a huge win. -best Gary The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

Mark Wiebe

6 Jul 6 Jul

6:36 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

...
On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been

talking

...
...
past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting

On Wed, Jul 6, 2011 at 5:48 PM, Peter <numpy-discussion@maubp.freeserve.co.uk> wrote: point

...
...
...
which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

I don't think that was what Mark was saying, see this bit later in this email:

I think it would make a difference if there was an implementation that had conflated masking with bitpatterns in terms of API. I don't think R is an example.

This reminds me of another confusion I've seen in the list. I'd like to suggest that we ban the word API by itself from the present discussion, and always specify Python API or C API for clarity's sake. Here are my suggested definitions for these two terms: Python API All the interface mechanisms that are exposed to Python code for using missing values in NumPy. This API is designed to be Pythonic and fit into the way NumPy works as much as possible. C API All the implementation mechanisms exposed for CPython extensions written in C that want to support NumPy missing value support. This API is designed to be as natural as possible in C, and is usually prioritizes flexibility and high performance. Before we proceed to any discussion of what are good/bad choices, I really want to nail this down from just the definition perspective. I don't want arbitrary choices baked into the terms we use, because that implies already having made a design decision. -Mark

...

...
...
...
The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

This point as I understood it is there is the semantics of the special values (not available vs ignore), and there is the implementation (bitpattern vs mask), and they are independent.

Yes. Although, we can see from the implementations that we have to hand that

a) bitpatterns -> propagation (NaN-like) semantics by default (R) b) masks -> ignore semantics by default (masked arrays)

I don't think Mark accepts that there is any reason for this tendency of implementations to semantics, but Nathaniel was arguing otherwise in the alterNEP.

I think we all accept that it's possible to imagine masking have propagation semantics and bitpatterns having ignore semantics.

Cheers,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Mark Wiebe

6:48 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 11:38 AM, Matthew Brett <matthew.brett@gmail.com>wrote:

...

Hi,

On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
It appears to me that one of the biggest reason some of us have been talking past each other in the discussions is that different people have different definitions for the terms being used. Until this is thoroughly cleared up, I feel the design process is tilting at windmills. In the interests of clarity in our discussions, here is a starting point which is consistent with the NEP. These definitions have been added in a glossary within the NEP. If there are any ideas for amendments to these definitions that we can agree on, I will update the NEP with those amendments. Also, if I missed any important terms which need to be added, please propose definitions for them. NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

Really? Can one implement NA with a mask in R? I thought an NA was always bitpattern in R?

...
IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element. bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE. mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE. numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase.

The most important distinctions I'm trying to draw are: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. 2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile.

I agree that there has been some confusion due to the terms.

However, I continue to believe that the discussion is substantial and not due to confusion.

I believe this is true as well, but the confusion due to the terms appears to be one of the root causes preventing the ideas from getting across. Without first clearing up this aspect of the discussion, things will stay confusing.

...

Let us then characterize the substantial discussion as this:

NEP: bitpattern and masked out values should be made nearly impossible to distinguish in the API alterNEP: bitpattern and masked out values should be distinct in the API so that it can be made clear which is meant (and therefore, implicitly, how they are implemented).

Do you agree that this is the discussion?

I'd like to get agreement on the definitions before moving to any of the points of contention that are being raised. Thanks, -Mark

...

See you,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Pierre GM

5:41 p.m.

New subject: using the same vocabulary for missing value ideas

Ah, semantics... On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote:

...

NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

I have a problem with 'temporarily hidden with a mask'. In my mind, the concept of NA carries a notion of perennation. The data is just not available, just as a NaN is just not a number.

...

IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element.

A data temporarily hidden by a mask becomes np.IGNORE.

...

bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE.

mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE.

numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase.

OK with that.

...

The most important distinctions I'm trying to draw are:

1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

OK with that.

...

2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile.

Indeed.

Mark Wiebe

6:56 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 12:41 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

Ah, semantics...

On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote:

...
NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

I have a problem with 'temporarily hidden with a mask'. In my mind, the concept of NA carries a notion of perennation. The data is just not available, just as a NaN is just not a number.

Yes, this gets directly to what I've been meaning when I say NA vs IGNORE is independent of mask vs bitpattern. The way I'm trying to structure things, NA vs IGNORE only affects the semantic meaning, i.e. the outputs produced by computations. This is precisely why I put 'temporarily hidden with a mask' first, to make that more clear.

...

...
IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element.

A data temporarily hidden by a mask becomes np.IGNORE.

Are you willing to suspend the idea of that implication for the purposes of the present discussion? If not, do you see a way to amend things so that masked NAs and bitpattern-based IGNOREs make sense? Would renaming IGNORE to SKIP be more clear, perhaps? Thanks, Mark

...

...
bitpattern A technique for implementing either NA or IGNORE, where a particular set of bit patterns are chosen from all the possible bit patterns of the value's data type to signal that the element is NA or IGNORE.

mask A technique for implementing either NA or IGNORE, where a boolean or enum array parallel to the data array is used to signal which elements are NA or IGNORE.

numpy.ma The existing implementation of a particular form of masked arrays, which is part of the NumPy codebase.

OK with that.

...
The most important distinctions I'm trying to draw are:

1) NA vs IGNORE and bitpattern vs mask are completely independent. Any

combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

OK with that.

...
2) The idea of masking and the numpy.ma implementation are different. The numpy.ma object makes particular choices about how to interpret the mask, but while backwards compatibility is important, a fresh evaluation of all the design choices going into a mask implementation is worthwhile.

Indeed. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Lluís

7 Jul 7 Jul

1:09 a.m.

New subject: using the same vocabulary for missing value ideas

Mark Wiebe writes:

...

On Wed, Jul 6, 2011 at 12:41 PM, Pierre GM <pgmdevlist@gmail.com> wrote: Ah, semantics...

...

On Jul 6, 2011, at 5:40 PM, Mark Wiebe wrote:

...
NA (Not Available) A placeholder for a value which is unknown to computations. That value may be temporarily hidden with a mask, may have been lost due to hard drive corruption, or gone for any number of reasons. This is the same as NA in the R project.

...

I have a problem with 'temporarily hidden with a mask'. In my mind, the concept of NA carries a notion of perennation. The data is just not available, just as a NaN is just not a number.

...

Yes, this gets directly to what I've been meaning when I say NA vs IGNORE is independent of mask vs bitpattern. The way I'm trying to structure things, NA vs IGNORE only affects the semantic meaning, i.e. the outputs produced by computations. This is precisely why I put 'temporarily hidden with a mask' first, to make that more clear.

...

...
IGNORE (Skip/Ignore) A placeholder which should be treated by computations as if no value does or could exist there. For sums, this means act as if the value were zero, and for products, this means act as if the value were one. It's as if the array were compressed in some fashion to not include that element.

...

A data temporarily hidden by a mask becomes np.IGNORE.

...

Are you willing to suspend the idea of that implication for the purposes of the present discussion? If not, do you see a way to amend things so that masked NAs and bitpattern-based IGNOREs make sense? Would renaming IGNORE to SKIP be more clear, perhaps?

Yes, I was going to propose something similar. The NA/IGNORE is about the propagation mechanism, and this is not as explicit in NA as it is in IGNORE. So maybe, and avoiding too much concept renaming: NA (Propagate) ... IGNORE (Skip) ... Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

Christopher Barker

6 Jul 6 Jul

6:25 p.m.

Mark Wiebe wrote:

...

1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Mark Wiebe

6:57 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker <Chris.Barker@noaa.gov>wrote:

...

Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

What do you think of renaming IGNORE to SKIP? -Mark

...

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Chris Barker

7 Jul 7 Jul

5:51 a.m.

On 7/6/11 11:57 AM, Mark Wiebe wrote:

...

On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker

...

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

What do you think of renaming IGNORE to SKIP?

This isn't a semantics issue -- IGNORE is fine. What I'm getting at is that we need a word (and code) for: "ignore for now, but I might want to use it later" - Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Eric Firing

6:46 a.m.

On 07/06/2011 07:51 PM, Chris Barker wrote:

...

On 7/6/11 11:57 AM, Mark Wiebe wrote:

...
On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker

...
Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

What do you think of renaming IGNORE to SKIP?

This isn't a semantics issue -- IGNORE is fine.

What I'm getting at is that we need a word (and code) for:

"ignore for now, but I might want to use it later"

HIDE? That implies there is still something there, potentially recoverable. Eric

...

- Chris

Pierre GM

9:14 a.m.

New subject: using the same vocabulary for missing value ideas

On Jul 7, 2011, at 8:46 AM, Eric Firing wrote:

...

On 07/06/2011 07:51 PM, Chris Barker wrote:

...
On 7/6/11 11:57 AM, Mark Wiebe wrote:

...
On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker

...
Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

What do you think of renaming IGNORE to SKIP?

This isn't a semantics issue -- IGNORE is fine.

What I'm getting at is that we need a word (and code) for:

"ignore for now, but I might want to use it later"

HIDE? That implies there is still something there, potentially recoverable.

Eric

Dag Sverre Seljebotn

9:15 a.m.

On 07/07/2011 07:51 AM, Chris Barker wrote:

...

On 7/6/11 11:57 AM, Mark Wiebe wrote:

...
On Wed, Jul 6, 2011 at 1:25 PM, Christopher Barker

...
Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

What do you think of renaming IGNORE to SKIP?

This isn't a semantics issue -- IGNORE is fine.

What I'm getting at is that we need a word (and code) for:

"ignore for now, but I might want to use it later"

Wouldn't that be IGNORE+MASK? There's (IGNORE, NA), and (MASK, BITPATTERN), with four combinations: IGNORE+MASK: "ignore for now, but I might want to use it later" NA+MASK: "treat as NA for now, but I might change my mind about that later" [1] IGNORE+BITPATTERN: Simply insert a value in an array that is 0 for addition and 1 for multiplication. IGNORE+BITPATTERN: R's NA. [1] Example on NA+MASK: Temporarily flag something as an invalid outlier to check what effect that has on final estimates. The statistical method one is using may do something different with NA data (beyond what IGNORE does), you may not know exactly what it does, just that the docs says "support NA's gracefully" and that you temporarily want to flag some outliers as such when calling that function. Dag Sverre

Dag Sverre Seljebotn

6 Jul 6 Jul

7:09 p.m.

On 07/06/2011 08:25 PM, Christopher Barker wrote:

...

Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not. Dag Sverre

Benjamin Root

10:03 p.m.

New subject: using the same vocabulary for missing value ideas

On Wednesday, July 6, 2011, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

...

On 07/06/2011 08:25 PM, Christopher Barker wrote:

...
Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not.

Dag Sverre

Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1. Ben Root

Christopher Jordan-Squire

10:21 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Wednesday, July 6, 2011, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

...
On 07/06/2011 08:25 PM, Christopher Barker wrote:

...
Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not.

Dag Sverre

Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1.

Ben Root

Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations.

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Benjamin Root

10:38 p.m.

On Wednesday, July 6, 2011, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...

On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root <ben.root@ou.edu> wrote:

On Wednesday, July 6, 2011, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

...
On 07/06/2011 08:25 PM, Christopher Barker wrote:

...
Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not.

Dag Sverre

Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1.

Ben Root

Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations.

If you are talking about means, for example, then the count is adjusted before dividing. It is like they never existed. Same with standard deviation. Of course, there are issues with having fewer samples, but that isn't a problem caused by the underlying concept of skipping elements. As long as the underlying mathematical support for array math is still valid, I am not certain what the issue is. Matrix math on the other hand... Ben Root

Christopher Jordan-Squire

11:08 p.m.

New subject: using the same vocabulary for missing value ideas

On Wed, Jul 6, 2011 at 5:38 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Wednesday, July 6, 2011, Christopher Jordan-Squire <cjordan1@uw.edu> wrote:

...
On Wed, Jul 6, 2011 at 5:03 PM, Benjamin Root <ben.root@ou.edu> wrote:

On Wednesday, July 6, 2011, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

...
On 07/06/2011 08:25 PM, Christopher Barker wrote:

...
Mark Wiebe wrote:

...
1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable.

Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to

stop

...
...
...
ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me.

There's the question of how reductions treats the value. IIUC, IGNORE as bitpattern would imply that reductions treat the value as 0, which is a question orthogonal to whether the value can possibly be unmasked or not.

Dag Sverre

Just because we are trying to be exact here, the reductions would treat IGNORE as the operation's identity. Therefore, for addition, it would be treated like 0, but for multiplication, it is treated like a 1.

Ben Root

Yes. But, as discussed on another thread, that can lead to unexpected results when it's propagated through several operations.

If you are talking about means, for example, then the count is adjusted before dividing. It is like they never existed. Same with standard deviation. Of course, there are issues with having fewer samples, but that isn't a problem caused by the underlying concept of skipping elements.

As long as the underlying mathematical support for array math is still valid, I am not certain what the issue is. Matrix math on the other hand...

Ah, I see. I misunderstood the class of operations you were discussing. -Chris Jordan-Squire

...

Ben Root _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4855

Age (days ago)

4856

Last active (days ago)

List overview

Download

26 comments

12 participants

participants (12)

Benjamin Root
Chris Barker
Christopher Barker
Christopher Jordan-Squire
Dag Sverre Seljebotn
Eric Firing
Gary Strangman
Lluís
Mark Wiebe
Matthew Brett
Peter
Pierre GM

using the same vocabulary for missing value ideas

Peter

Peter

Benjamin Root

Christopher Jordan-Squire

Pierre GM

Lluís

Pierre GM

Benjamin Root

Christopher Jordan-Squire

Benjamin Root

Christopher Jordan-Squire

tags

participants (12)