[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 11:30:32 EDT 2011

On Fri, Jun 24, 2011 at 10:02, Pierre GM <pgmdevlist at gmail.com> wrote:
>
> On Jun 24, 2011, at 4:44 PM, Robert Kern wrote:
>
>> On Fri, Jun 24, 2011 at 09:35, Robert Kern <robert.kern at gmail.com> wrote:
>>> On Fri, Jun 24, 2011 at 09:24, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> On Fri, Jun 24, 2011 at 7:06 AM, Robert Kern <robert.kern at gmail.com> wrote:
>>>>
>>>>> The alternative proposal would be to add a few new dtypes that are
>>>>> NA-aware. E.g. an nafloat64 would reserve a particular NaN value
>>>>> (there are lots of different NaN bit patterns, we'd just reserve one)
>>>>> that would represent NA. An naint32 would probably reserve the most
>>>>> negative int32 value (like R does). Using the NA-aware dtypes signals
>>>>> that you are using NA values; there is no need for an additional flag.
>>>>
>>>> I don't understand the numpy design and maintainable issues, but from
>>>> a user perspective (mine) nafloat64, etc sounds nice.
>>>
>>> It's worth noting that this is not a replacement for masked arrays,
>>> nor is it intended to be the be-all, end-all solution to missing data
>>> problems. It's mostly just intended to be a focused tool to fill in
>>> the gaps where masked arrays are less convenient for whatever reason;
>>> e.g. where you're tempted to (ab)use NaNs for the purpose and the
>>> limitations on the range of values is acceptable. Not every dtype
>>> would have an NA-aware counterpart. I would suggest just nabool,
>>> nafloat64, naint32, nastring (a little tricky due to the flexible
>>> size, but doable), and naobject. Maybe a couple more, if we get
>>> requests, like naint64 and nacomplex128.
>>
>> Oh, and nadatetime64 and natimedelta64.
>
> So, if I understand correctly:
> if my array has a nafloat type, it's an array that supports missing values and it will always have a mask, right ?

Not quite; there are no separate mask arrays with this approach. It's
more akin to using NaNs to represent missing values except with more
rigor. NA values won't be "accidentally" created from computations
from non-NA values.

> And just viewing an array as a nafloat dtyped one would make it an 'array-with-missing-values' ? That's pretty elegant. I like that.
> Now, how will masked values represented ?

For the float types, we use a particular NaN bit-pattern (we'll steal
R's choice). For the int types, we use the most negative number. For
strings, R uses 'NA', but I'd *like* to use something less likely to
conflict with actual use. For the date/time types, we would reserve a
value close to the NaT value. For objects, we would have a singleton
created specifically for this purpose. bools, which are internally
represented by a uint8, will use 2.

> Different masked values from one dtype to another ? What would be the equivalent of something like `if a[0] is masked` that we have know?

I would suggest following R's lead and letting ((NA==NA) == True)
unlike NaNs. Each NA-aware scalar type would have a class attribute
giving its NA value:

  if a[0] == nafloat64.NA:
      ...

  good_values = (a != nafloat64.NA)

You could possibly make a general NA object with smart comparison
methods that will inspect the dtype of the other object so you don't
have to know the dtype in your code, but that's a little magic.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco