Mailman 3 Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray - NumPy-Discussion

newer
[ANN] Numerical integration with...

Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

older
correlation in fourier domain

Laurent Gautier

24 Jun 2011 24 Jun '11

8:30 p.m.

On 2011-06-24 13:59, Nathaniel Smith wrote:

...

...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ? Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell. If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust. L. PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN.

Show replies by date

Charles R Harris

24 Jun 24 Jun

9:33 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 6:30 AM, Laurent Gautier wrote:

...

On 2011-06-24 13:59, Nathaniel Smith wrote:

...
...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.

Interesting thought. Doing that could be handled by adding r-dtypes, as in r-float32, r-int, etc. However, adding so many dtypes with different behaviors could make for a messy implementation, whereas masks would be uniform across types. Chuck

Robert Kern

10:06 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote:

...

On 2011-06-24 13:59, Nathaniel Smith wrote:

...
...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.

The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Keith Goodman

10:24 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...

The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

Robert Kern

10:35 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...

On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Robert Kern

10:44 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote:

...

On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.

Oh, and nadatetime64 and natimedelta64. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Charles R Harris

10:46 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 8:44 AM, Robert Kern wrote:

...

On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.

Oh, and nadatetime64 and natimedelta64.

Beat me to it ;) Chuck

Pierre GM

11:02 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:

...

On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.

Oh, and nadatetime64 and natimedelta64.

So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. Now, how will masked values represented ? Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?

Robert Kern

11:30 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 10:02, Pierre GM wrote:

...

On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.

Oh, and nadatetime64 and natimedelta64.

So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ?

Not quite; there are no separate mask arrays with this approach. It's more akin to using NaNs to represent missing values except with more rigor. NA values won't be "accidentally" created from computations from non-NA values.

...

And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. Now, how will masked values represented ?

For the float types, we use a particular NaN bit-pattern (we'll steal R's choice). For the int types, we use the most negative number. For strings, R uses 'NA', but I'd *like* to use something less likely to conflict with actual use. For the date/time types, we would reserve a value close to the NaT value. For objects, we would have a singleton created specifically for this purpose. bools, which are internally represented by a uint8, will use 2.

...

Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?

I would suggest following R's lead and letting ((NA==NA) == True) unlike NaNs. Each NA-aware scalar type would have a class attribute giving its NA value: if a[0] == nafloat64.NA: ... good_values = (a != nafloat64.NA) You could possibly make a general NA object with smart comparison methods that will inspect the dtype of the other object so you don't have to know the dtype in your code, but that's a little magic. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Nathaniel Smith

25 Jun 25 Jun

12:11 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 8:30 AM, Robert Kern wrote:

...

I would suggest following R's lead and letting ((NA==NA) == True) unlike NaNs.

In R, NA and NaN do behave differently with respect to ==, but not the way you're saying:

...

NA == NA [1] NA if (NA == NA) 1; Error in if (NA == NA) 1 : missing value where TRUE/FALSE needed

This again is consistent with the semantics that NA represents some unknown concrete value -- depending on what actual values, NA == NA might or might not be true, we don't know. So it's NA as well. -- Nathaniel

Mark Wiebe

1:55 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 10:02 AM, Pierre GM wrote:

...

On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:35, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:24, Keith Goodman wrote:

...
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.

Oh, and nadatetime64 and natimedelta64.

So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that.

My understanding is a little bit different: The na* discussion is about implementing a full or partial set of "shadow types" which are like their regular types, but have a signal value indicating they are "NA". There's another idea, to create a parameterized type mechanism with types like "NA[int32]", adding a missing-value flag to the int32 and growing its size by up to the dtype's alignment. Using the mask to implement the missing value semantics at the array level instead of the dtype level is my proposal, neither of the others involve separate masks.

...

Now, how will masked values represented ? Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?

If there's a global np.NA singleton, `if a[0] is np.NA` would work equivalently. That's a strike against storing the dtype with the NA object. -Mark

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthew Brett

2:04 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

Hi, Just as a use case, if I do this: a = np.zeros((big_number,), dtype=np.int32) a[0,0] = np.NA I think I'm right in saying that, with the array.mask implementation my array memory usage with grow by new big_number bytes, whereas with the np.naint32 implementation you'd get something like: Error('This data type does not allow missing values') Is that right? See y'all, Matthew

Mark Wiebe

3:38 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 1:04 PM, Matthew Brett wrote:

...

Hi,

Just as a use case, if I do this:

a = np.zeros((big_number,), dtype=np.int32) a[0,0] = np.NA

I think I'm right in saying that, with the array.mask implementation my array memory usage with grow by new big_number bytes, whereas with the np.naint32 implementation you'd get something like:

Error('This data type does not allow missing values')

Is that right?

Not really, I much prefer having the operation of adding a mask always be very explicit. It should raise an exception along the lines of "Cannot assign the NA missing value to an array with no validity mask". -Mark

...

See y'all,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Barker

12:25 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

Robert Kern wrote:

...

It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason;

I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a "uniter, not a divider". While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation.

...

Not every dtype would have an NA-aware counterpart.

One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all. By the way, what might be the performance hit of a "new" dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an "if" statement with every single computation. But maybe that isn't any more of a hit that a separate mask. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Mark Wiebe

2:14 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 11:25 AM, Christopher Barker wrote:

...

Robert Kern wrote:

...
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason;

I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a "uniter, not a divider".

While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation.

If someone wants to have the current numpy.ma semantics with regards to NaN's becoming masked automatically, or some other behavior not in my proposal, I think a subclass with tweaks would be good, yes.

...

Not every dtype

...
would have an NA-aware counterpart.

One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all.

By the way, what might be the performance hit of a "new" dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an "if" statement with every single computation. But maybe that isn't any more of a hit that a separate mask.

There is also a potential reliability issue with the na* dtype approach. Every single operation of every dtype has to get the NA logic correct. While there may be ways to unify the code in various ways, ensuring robustness of that is more difficult than for a system which has a single implementation of the NA logic applied to masks. -Mark

...

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Bruce Southey

24 Jun 24 Jun

10:27 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On 06/24/2011 09:06 AM, Robert Kern wrote:

...

...
On 2011-06-24 13:59, Nathaniel Smith wrote:

...
...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust. The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one)

On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote: that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

There is an very important distinction here between a masked value and a missing value. In some sense, a missing value is a permanently masked value and may be indistinguishable from 'not a number'. But a masked value is not a missing value because with masked arrays the original values still exists. Consequently using a masked value for 'missing-ness' can be reversed by the user changing the mask at any time. That is really the power of masked arrays as you can real missing values but also 'flag' unusual values as missing! Virtually software packages are handling missing values not masked values. So it is really, really important that you clarify what you are proposing because your proposal does mix these two different concepts. As per the missing value discussion, I would think that adding a missing value data type(s) for 'missing values' would be feasible and may be something that numpy should have. But that would not address 'masked values' and probably must be view as independent topic and thread. Below are some sources for missing values in R and SAS. SAS has 28 ways that a user can define numerical values as 'missing values' - not just the dot! While not apparently universal, SAS has missing value codes to handle positive and negative infinity. R does differ between missing values and 'not a number' which, to my knowledge, SAS does not do. This distinction is probably important for masked vs missing values. SAS uses a blank for missing character values but see the two links at the end for more than. This is for R: http://faculty.nps.edu/sebuttre/home/S/missings.html http://www.ats.ucla.edu/stat/r/faq/missing.htm http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.... This page is a comparison to R: http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer.... Some other SAS sources: Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF 072-2011: Special Missing Values for Character Fields - SAS support.*sas*.com/resources/papers/proceedings11/072-2011.pdf Bruce

Mark Wiebe

25 Jun 25 Jun

1:43 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 9:27 AM, Bruce Southey wrote:

...

** On 06/24/2011 09:06 AM, Robert Kern wrote:

On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote:

On 2011-06-24 13:59, Nathaniel Smith wrote:

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote:

Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.).

Since R is designed for statistics, they made the interesting decision that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.

The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

There is an very important distinction here between a masked value and a missing value. In some sense, a missing value is a permanently masked value and may be indistinguishable from 'not a number'. But a masked value is not a missing value because with masked arrays the original values still exists. Consequently using a masked value for 'missing-ness' can be reversed by the user changing the mask at any time. That is really the power of masked arrays as you can real missing values but also 'flag' unusual values as missing!

In the design I'm proposing, it's using a mask to implement missing values, hence the usage of the terms "masked" and "unmasked" elements. The semantics you're describing can be achieved with the missing value interpretation. First, you take a view of your array, then give it a mask. In the view, there will be strict missing data semantics, but the data is still accessible through the original array. When people come to NumPy asking about missing values, they generally get pointed at numpy.ma, so there is an impression out there that it's intended for that usage. Virtually software packages are handling missing values not masked values.

...

So it is really, really important that you clarify what you are proposing because your proposal does mix these two different concepts.

As per the missing value discussion, I would think that adding a missing value data type(s) for 'missing values' would be feasible and may be something that numpy should have. But that would not address 'masked values' and probably must be view as independent topic and thread.

Can you describe what needed features are missing from taking a view + adding a mask, that 'masked values' as a separate concept would have? While different, I think a single implementation with strict semantics can provide both perspectives when used like this or in a similar fashion.

...

Below are some sources for missing values in R and SAS. SAS has 28 ways that a user can define numerical values as 'missing values' - not just the dot! While not apparently universal, SAS has missing value codes to handle positive and negative infinity. R does differ between missing values and 'not a number' which, to my knowledge, SAS does not do.

The question this raises in my mind is whether an "NA"-like object in NumPy should have a type associated with it, or whether it should be a singleton. Like np.NA('i8') for a missing 64-bit int, or np.NA like None but with the specific missing value semantics. Keeping the type around would allow for checking against casting rules.

...

This distinction is probably important for masked vs missing values. SAS uses a blank for missing character values but see the two links at the end for more than.

This is for R: http://faculty.nps.edu/sebuttre/home/S/missings.html http://www.ats.ucla.edu/stat/r/faq/missing.htm

http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.... This page is a comparison to R:

http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer....

Some other SAS sources: Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF 072-2011: Special Missing Values for Character Fields - SAS support.*sas*.com/resources/papers/proceedings11/072-2011.pdf

Thanks for the links! -Mark

...

Bruce

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

24 Jun 24 Jun

10:33 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote:

...

On Fri, Jun 24, 2011 at 07:30, Laurent Gautier wrote:

...
On 2011-06-24 13:59, Nathaniel Smith wrote:

...
...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.

The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration. Chuck

Robert Kern

10:43 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 09:33, Charles R Harris wrote:

...

On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote:

...

...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.

I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that.... -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Matthew Brett

11:07 p.m.

New subject: feedback request: proposal to add masks to the core ndarray

Hi, On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern wrote:

...

On Fri, Jun 24, 2011 at 09:33, Charles R Harris wrote:

...
On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern wrote:

...
...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.

I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that....

It would certainly help me at least if someone (Mark? sorry to ask...) could set out the implementation and API differences that would result from the two options: 1) array.mask option - an integer array of shape array.shape giving mask (True, False) values for each element 2) nafloat64 option - dtypes with specified dtype-specific missing values Best, Matthew

Mark Wiebe

25 Jun 25 Jun

1:57 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 10:07 AM, Matthew Brett wrote:

...

Hi,

On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern wrote:

...
On Fri, Jun 24, 2011 at 09:33, Charles R Harris wrote:

...
On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern

wrote:

...
...
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.

I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that....

It would certainly help me at least if someone (Mark? sorry to ask...) could set out the implementation and API differences that would result from the two options:

1) array.mask option - an integer array of shape array.shape giving mask (True, False) values for each element 2) nafloat64 option - dtypes with specified dtype-specific missing values

That's something that should go in the NEP, I'll email when I update it. -Mark

...

Best,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

12:28 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern wrote:

...

The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.

For floats, this is easy, because NaN's are already built in. For integers, I worry a bit, because we'd have to break the usual two's complement arithmetic. int32 is closed under addition/multiplication/bitops. But for naint32, what's INT_MAX + 1? (In R, the answer is that *all* integer overflows are tested for and become NA, whether they would happen to land on INT_MIN or not, and AFAICT there are no bitops for integers.) For strings in the numpy context, just adding another byte to hold the NA-ness flag seems more sensible than stealing some random string. In both cases, the more generic maybe() dtype I suggested might be cleaner. -- Nathaniel

Mark Wiebe

12:48 a.m.

New subject: feedback request: proposal to add masks to the core ndarray

On Fri, Jun 24, 2011 at 7:30 AM, Laurent Gautier wrote:

...

On 2011-06-24 13:59, Nathaniel Smith wrote:

...
...
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision

On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)

Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").

That's basically it.

-- Nathaniel

Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?

I think that's something R2Py would have to handle in its compatibility later. I'd like to first make the system within NumPy work well for NumPy, interoperability at the low ABI level like this is a bit too restricting, I think. -Mark Any given numpy array could have a boolean flag (say "na_aware")

...

indicating that some of the values are representing a missing cell.

If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.

L.

PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4690

Age (days ago)

4690

Last active (days ago)

List overview

Download

22 comments

10 participants

participants (10)

Bruce Southey
Charles R Harris
Christopher Barker
Keith Goodman
Laurent Gautier
Mark Wiebe
Matthew Brett
Nathaniel Smith
Pierre GM
Robert Kern

Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Keith Goodman

Pierre GM

Bruce Southey

tags

participants (10)