Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
On 2011-06-24 13:59, Nathaniel Smith
Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ? Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell. If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust. L. PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN.
On Fri, Jun 24, 2011 at 6:30 AM, Laurent Gautier
On 2011-06-24 13:59, Nathaniel Smith
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.
Interesting thought. Doing that could be handled by adding r-dtypes, as in r-float32, r-int, etc. However, adding so many dtypes with different behaviors could make for a messy implementation, whereas masks would be uniform across types. Chuck
On Fri, Jun 24, 2011 at 07:30, Laurent Gautier
On 2011-06-24 13:59, Nathaniel Smith
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
On Fri, Jun 24, 2011 at 09:24, Keith Goodman
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Jun 24, 2011 at 09:35, Robert Kern
On Fri, Jun 24, 2011 at 09:24, Keith Goodman
wrote: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.
Oh, and nadatetime64 and natimedelta64. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Jun 24, 2011 at 8:44 AM, Robert Kern
On Fri, Jun 24, 2011 at 09:35, Robert Kern
wrote: On Fri, Jun 24, 2011 at 09:24, Keith Goodman
wrote: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.
Oh, and nadatetime64 and natimedelta64.
Beat me to it ;) Chuck
On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:
On Fri, Jun 24, 2011 at 09:35, Robert Kern
wrote: On Fri, Jun 24, 2011 at 09:24, Keith Goodman
wrote: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.
Oh, and nadatetime64 and natimedelta64.
So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. Now, how will masked values represented ? Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?
On Fri, Jun 24, 2011 at 10:02, Pierre GM
On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:
On Fri, Jun 24, 2011 at 09:35, Robert Kern
wrote: On Fri, Jun 24, 2011 at 09:24, Keith Goodman
wrote: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.
Oh, and nadatetime64 and natimedelta64.
So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ?
Not quite; there are no separate mask arrays with this approach. It's more akin to using NaNs to represent missing values except with more rigor. NA values won't be "accidentally" created from computations from non-NA values.
And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that. Now, how will masked values represented ?
For the float types, we use a particular NaN bit-pattern (we'll steal R's choice). For the int types, we use the most negative number. For strings, R uses 'NA', but I'd *like* to use something less likely to conflict with actual use. For the date/time types, we would reserve a value close to the NaT value. For objects, we would have a singleton created specifically for this purpose. bools, which are internally represented by a uint8, will use 2.
Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?
I would suggest following R's lead and letting ((NA==NA) == True) unlike NaNs. Each NA-aware scalar type would have a class attribute giving its NA value: if a[0] == nafloat64.NA: ... good_values = (a != nafloat64.NA) You could possibly make a general NA object with smart comparison methods that will inspect the dtype of the other object so you don't have to know the dtype in your code, but that's a little magic. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Fri, Jun 24, 2011 at 8:30 AM, Robert Kern
I would suggest following R's lead and letting ((NA==NA) == True) unlike NaNs.
In R, NA and NaN do behave differently with respect to ==, but not the way you're saying:
NA == NA [1] NA if (NA == NA) 1; Error in if (NA == NA) 1 : missing value where TRUE/FALSE needed
This again is consistent with the semantics that NA represents some unknown concrete value -- depending on what actual values, NA == NA might or might not be true, we don't know. So it's NA as well. -- Nathaniel
On Fri, Jun 24, 2011 at 10:02 AM, Pierre GM
On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:
On Fri, Jun 24, 2011 at 09:35, Robert Kern
wrote: On Fri, Jun 24, 2011 at 09:24, Keith Goodman
wrote: On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
I don't understand the numpy design and maintainable issues, but from a user perspective (mine) nafloat64, etc sounds nice.
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; e.g. where you're tempted to (ab)use NaNs for the purpose and the limitations on the range of values is acceptable. Not every dtype would have an NA-aware counterpart. I would suggest just nabool, nafloat64, naint32, nastring (a little tricky due to the flexible size, but doable), and naobject. Maybe a couple more, if we get requests, like naint64 and nacomplex128.
Oh, and nadatetime64 and natimedelta64.
So, if I understand correctly: if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ? And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that.
My understanding is a little bit different: The na* discussion is about implementing a full or partial set of "shadow types" which are like their regular types, but have a signal value indicating they are "NA". There's another idea, to create a parameterized type mechanism with types like "NA[int32]", adding a missing-value flag to the int32 and growing its size by up to the dtype's alignment. Using the mask to implement the missing value semantics at the array level instead of the dtype level is my proposal, neither of the others involve separate masks.
Now, how will masked values represented ? Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?
If there's a global np.NA singleton, `if a[0] is np.NA` would work equivalently. That's a strike against storing the dtype with the NA object. -Mark
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, Just as a use case, if I do this: a = np.zeros((big_number,), dtype=np.int32) a[0,0] = np.NA I think I'm right in saying that, with the array.mask implementation my array memory usage with grow by new big_number bytes, whereas with the np.naint32 implementation you'd get something like: Error('This data type does not allow missing values') Is that right? See y'all, Matthew
On Fri, Jun 24, 2011 at 1:04 PM, Matthew Brett
Hi,
Just as a use case, if I do this:
a = np.zeros((big_number,), dtype=np.int32) a[0,0] = np.NA
I think I'm right in saying that, with the array.mask implementation my array memory usage with grow by new big_number bytes, whereas with the np.naint32 implementation you'd get something like:
Error('This data type does not allow missing values')
Is that right?
Not really, I much prefer having the operation of adding a mask always be very explicit. It should raise an exception along the lines of "Cannot assign the NA missing value to an array with no validity mask". -Mark
See y'all,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Robert Kern wrote:
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason;
I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a "uniter, not a divider". While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation.
Not every dtype would have an NA-aware counterpart.
One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all. By the way, what might be the performance hit of a "new" dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an "if" statement with every single computation. But maybe that isn't any more of a hit that a separate mask. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Jun 24, 2011 at 11:25 AM, Christopher Barker
Robert Kern wrote:
It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason;
I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a "uniter, not a divider".
While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation.
If someone wants to have the current numpy.ma semantics with regards to NaN's becoming masked automatically, or some other behavior not in my proposal, I think a subclass with tweaks would be good, yes.
Not every dtype
would have an NA-aware counterpart.
One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all.
By the way, what might be the performance hit of a "new" dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an "if" statement with every single computation. But maybe that isn't any more of a hit that a separate mask.
There is also a potential reliability issue with the na* dtype approach. Every single operation of every dtype has to get the NA logic correct. While there may be ways to unify the code in various ways, ensuring robustness of that is more difficult than for a system which has a single implementation of the NA logic applied to masks. -Mark
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 06/24/2011 09:06 AM, Robert Kern wrote:
On 2011-06-24 13:59, Nathaniel Smith
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust. The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one)
On Fri, Jun 24, 2011 at 07:30, Laurent Gautier
wrote: that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
There is an very important distinction here between a masked value and a missing value. In some sense, a missing value is a permanently masked value and may be indistinguishable from 'not a number'. But a masked value is not a missing value because with masked arrays the original values still exists. Consequently using a masked value for 'missing-ness' can be reversed by the user changing the mask at any time. That is really the power of masked arrays as you can real missing values but also 'flag' unusual values as missing! Virtually software packages are handling missing values not masked values. So it is really, really important that you clarify what you are proposing because your proposal does mix these two different concepts. As per the missing value discussion, I would think that adding a missing value data type(s) for 'missing values' would be feasible and may be something that numpy should have. But that would not address 'masked values' and probably must be view as independent topic and thread. Below are some sources for missing values in R and SAS. SAS has 28 ways that a user can define numerical values as 'missing values' - not just the dot! While not apparently universal, SAS has missing value codes to handle positive and negative infinity. R does differ between missing values and 'not a number' which, to my knowledge, SAS does not do. This distinction is probably important for masked vs missing values. SAS uses a blank for missing character values but see the two links at the end for more than. This is for R: http://faculty.nps.edu/sebuttre/home/S/missings.html http://www.ats.ucla.edu/stat/r/faq/missing.htm http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.... This page is a comparison to R: http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer.... Some other SAS sources: Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF 072-2011: Special Missing Values for Character Fields - SAS support.*sas*.com/resources/papers/proceedings11/072-2011.pdf Bruce
On Fri, Jun 24, 2011 at 9:27 AM, Bruce Southey
** On 06/24/2011 09:06 AM, Robert Kern wrote:
On Fri, Jun 24, 2011 at 07:30, Laurent Gautier
wrote: On 2011-06-24 13:59, Nathaniel Smith
wrote: On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.).
Since R is designed for statistics, they made the interesting decision that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.)
Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
There is an very important distinction here between a masked value and a missing value. In some sense, a missing value is a permanently masked value and may be indistinguishable from 'not a number'. But a masked value is not a missing value because with masked arrays the original values still exists. Consequently using a masked value for 'missing-ness' can be reversed by the user changing the mask at any time. That is really the power of masked arrays as you can real missing values but also 'flag' unusual values as missing!
In the design I'm proposing, it's using a mask to implement missing values, hence the usage of the terms "masked" and "unmasked" elements. The semantics you're describing can be achieved with the missing value interpretation. First, you take a view of your array, then give it a mask. In the view, there will be strict missing data semantics, but the data is still accessible through the original array. When people come to NumPy asking about missing values, they generally get pointed at numpy.ma, so there is an impression out there that it's intended for that usage. Virtually software packages are handling missing values not masked values.
So it is really, really important that you clarify what you are proposing because your proposal does mix these two different concepts.
As per the missing value discussion, I would think that adding a missing value data type(s) for 'missing values' would be feasible and may be something that numpy should have. But that would not address 'masked values' and probably must be view as independent topic and thread.
Can you describe what needed features are missing from taking a view + adding a mask, that 'masked values' as a separate concept would have? While different, I think a single implementation with strict semantics can provide both perspectives when used like this or in a similar fashion.
Below are some sources for missing values in R and SAS. SAS has 28 ways that a user can define numerical values as 'missing values' - not just the dot! While not apparently universal, SAS has missing value codes to handle positive and negative infinity. R does differ between missing values and 'not a number' which, to my knowledge, SAS does not do.
The question this raises in my mind is whether an "NA"-like object in NumPy should have a type associated with it, or whether it should be a singleton. Like np.NA('i8') for a missing 64-bit int, or np.NA like None but with the specific missing value semantics. Keeping the type around would allow for checking against casting rules.
This distinction is probably important for masked vs missing values. SAS uses a blank for missing character values but see the two links at the end for more than.
This is for R: http://faculty.nps.edu/sebuttre/home/S/missings.html http://www.ats.ucla.edu/stat/r/faq/missing.htm
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.... This page is a comparison to R:
http://support.sas.com/documentation/cdl/en/imlug/63541/HTML/default/viewer....
Some other SAS sources: Malachy J. Foley "MISSING VALUES: Everything You Ever Wanted to Know" http://analytics.ncsu.edu/sesug/2005/TU06_05.PDF 072-2011: Special Missing Values for Character Fields - SAS support.*sas*.com/resources/papers/proceedings11/072-2011.pdf
Thanks for the links! -Mark
Bruce
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern
On Fri, Jun 24, 2011 at 07:30, Laurent Gautier
wrote: On 2011-06-24 13:59, Nathaniel Smith
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
Any given numpy array could have a boolean flag (say "na_aware") indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration. Chuck
On Fri, Jun 24, 2011 at 09:33, Charles R Harris
On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern
wrote:
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.
I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that.... -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Hi,
On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern
On Fri, Jun 24, 2011 at 09:33, Charles R Harris
wrote: On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern
wrote: The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.
I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that....
It would certainly help me at least if someone (Mark? sorry to ask...) could set out the implementation and API differences that would result from the two options: 1) array.mask option - an integer array of shape array.shape giving mask (True, False) values for each element 2) nafloat64 option - dtypes with specified dtype-specific missing values Best, Matthew
On Fri, Jun 24, 2011 at 10:07 AM, Matthew Brett
Hi,
On Fri, Jun 24, 2011 at 3:43 PM, Robert Kern
wrote: On Fri, Jun 24, 2011 at 09:33, Charles R Harris
wrote: On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern
wrote:
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
Definitely better names than r-int32. Going this way has the advantage of reducing the friction between R and numpy, and since R has pretty much become the standard software for statistics that is an important consideration.
I would definitely steal their choices of NA value for naint32 and nafloat64. I have reservations about their string NA value (i.e. 'NA') as anyone doing business in North America and other continents may have issues with that....
It would certainly help me at least if someone (Mark? sorry to ask...) could set out the implementation and API differences that would result from the two options:
1) array.mask option - an integer array of shape array.shape giving mask (True, False) values for each element 2) nafloat64 option - dtypes with specified dtype-specific missing values
That's something that should go in the NEP, I'll email when I update it. -Mark
Best,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern
The alternative proposal would be to add a few new dtypes that are NA-aware. E.g. an nafloat64 would reserve a particular NaN value (there are lots of different NaN bit patterns, we'd just reserve one) that would represent NA. An naint32 would probably reserve the most negative int32 value (like R does). Using the NA-aware dtypes signals that you are using NA values; there is no need for an additional flag.
For floats, this is easy, because NaN's are already built in. For integers, I worry a bit, because we'd have to break the usual two's complement arithmetic. int32 is closed under addition/multiplication/bitops. But for naint32, what's INT_MAX + 1? (In R, the answer is that *all* integer overflows are tested for and become NA, whether they would happen to land on INT_MIN or not, and AFAICT there are no bitops for integers.) For strings in the numpy context, just adding another byte to hold the NA-ness flag seems more sensible than stealing some random string. In both cases, the more generic maybe() dtype I suggested might be cleaner. -- Nathaniel
On Fri, Jun 24, 2011 at 7:30 AM, Laurent Gautier
On 2011-06-24 13:59, Nathaniel Smith
wrote: Lastly, I am not entirely familiar with R, so I am also very curious about what this magical "NA" value is, and how it compares to how NaNs work. Although, Pierre brought up the very good point that NaNs woulldn't work anyway with integer arrays (and object arrays, etc.). Since R is designed for statistics, they made the interesting decision
On Thu, Jun 23, 2011 at 5:56 PM, Benjamin Root
wrote: that *all* of their core types have a special designated "missing" value. At the R level this is just called "NA". Internally, there are a bunch of different NA values -- for floats it's a particular NaN, for integers it's INT_MIN, for booleans it's 2 (IIRC), etc. (You never notice this, because R will silently cast a NA of one type into NA of another type whenever needed, and they all print the same.) Because any array can contain NA's, all R functions then have to have some way of handling this -- all their integer arithmetic knows that INT_MIN is special, for instance. The rules are basically the same as for NaN's, but NA and NaN are different from each other (because one means "I don't know, could be anything" and the other means "you tried to divide by 0, I *know* that's meaningless").
That's basically it.
-- Nathaniel
Would the use of R's system for expressing "missing values" be possible in numpy through a special flag ?
I think that's something R2Py would have to handle in its compatibility later. I'd like to first make the system within NumPy work well for NumPy, interoperability at the low ABI level like this is a bit too restricting, I think. -Mark Any given numpy array could have a boolean flag (say "na_aware")
indicating that some of the values are representing a missing cell.
If the exact same system is used, interaction with R (through something like rpy2) would be simplified and more robust.
L.
PS: In R, dividing one by zero returns +/-Inf, not NaN. 0/0 returns NaN.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (10)
-
Bruce Southey
-
Charles R Harris
-
Christopher Barker
-
Keith Goodman
-
Laurent Gautier
-
Mark Wiebe
-
Matthew Brett
-
Nathaniel Smith
-
Pierre GM
-
Robert Kern