[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Sat Jun 25 02:36:04 EDT 2011

On 2011-06-24 17:30, Robert Kern <robert.kern at gmail.com> wrote:
> On Fri, Jun 24, 2011 at 10:07, Laurent Gautier<lgautier at gmail.com>  wrote:
>> >  On 2011-06-24 16:43, Robert Kern<robert.kern at gmail.com>  wrote:
>>> >>
>>> >>  On Fri, Jun 24, 2011 at 09:33, Charles R Harris
>>> >>  <charlesr.harris at gmail.com>  wrote:
>>>> >>>
>>>>> >>>  >
>>>>> >>>  >  ?On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern<robert.kern at gmail.com>
>>>>> >>>  >  ?wrote:
>>>>> >>>>
>>>>>>> >>>>  >>  ?The alternative proposal would be to add a few new dtypes that are
>>>>>>> >>>>  >>  ?NA-aware. E.g. an nafloat64 would reserve a particular NaN value
>>>>>>> >>>>  >>  ?(there are lots of different NaN bit patterns, we'd just reserve
>>>>>>> >>>>  >>  one)
>>>>>>> >>>>  >>  ?that would represent NA. An naint32 would probably reserve the most
>>>>>>> >>>>  >>  ?negative int32 value (like R does). Using the NA-aware dtypes
>>>>>>> >>>>  >>  signals
>>>>>>> >>>>  >>  ?that you are using NA values; there is no need for an additional
>>>>>>> >>>>  >>  flag.
>>>> >>>
>>>>> >>>  >
>>>>> >>>  >  ?Definitely better names than r-int32. Going this way has the advantage
>>>>> >>>  >  of
>>>>> >>>  >  ?reducing the friction between R and numpy, and since R has pretty much
>>>>> >>>  >  ?become the standard software for statistics that is an important
>>>>> >>>  >  ?consideration.
>>> >>
>>> >>  I would definitely steal their choices of NA value for naint32 and
>>> >>  nafloat64. I have reservations about their string NA value (i.e. 'NA')
>>> >>  as anyone doing business in North America and other continents may
>>> >>  have issues with that....
>> >
>> >  May be there is not so much need for reservation over the string NA, when
>> >  making the distinction between:
>> >  a- the internal representation of a "missing string" (what is stored in
>> >  memory, and that C-level code would need to be aware of)
>> >  b- the 'external' representation of a missing string (in Python, what would
>> >  be returned by repr() )
>> >  c- what is assumed to be a missing string value when reading from a file.
>> >
>> >  a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can
>> >  be configured as a module-level, class-level, or instance-level variable.
> In R, a/ happens to be 'NA', unfortunately. :-/

In a sense yes, in a sense no.

There is NA_STRING (that happens to store 'NA', but it could equally be 
'foobar' or whatever) and there is "NA".
NA_STRING is set once for all, and each time a string element in a 
vector is set to NA this points to that one.

A string "NA" is not the NA_STRING.

> I'm not really sure how they handle datasets that use valid 'NA'
> values. Presumably, their input routines allow one to convert such
> values to something else such that it can use 'NA'==NA internally.

That's c/. Example in R's read.table(..., na.strings = "NA", ...).

A number of R design choices (that are S design choices here) are 
empirical and based on experience.
Datasets can originally be in a number of different flavours and 
conversion can be made when reading data into memory / R format.

L.
> -- Robert Kern