[Numpy-discussion] Missing data again

Wed Mar 7 12:17:02 EST 2012

On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig <pierre.haessig at crans.org>wrote:

> Hi,
>
> Thanks you very much for your lights !
>
> Le 06/03/2012 21:59, Nathaniel Smith a écrit :
> > Right -- R has a very impoverished type system as compared to numpy.
> > There's basically four types: "numeric" (meaning double precision
> > float), "integer", "logical" (boolean), and "character" (string). And
> > in practice the integer type is essentially unused, because R parses
> > numbers like "1" as being floating point, not integer; the only way to
> > get an integer value is to explicitly cast to it. Each of these types
> > has a specific bit-pattern set aside for representing NA. And...
> > that's it. It's very simple when it works, but also very limited.
> I also suspected R to be less powerful in terms of types.
> However, I think  the fact that "It's very simple when it works" is
> important to take into account. At the end of the day, when using all
> the fanciness it is not only about "can I have some NAs in my array ?"
> but also "how *easily* can I have some NAs in my array ?". It's about
> balancing the "how easy" and the "how powerful".
>
> The easyness-of-use is the reason of my concern about having separate
> types "nafloatNN" and "floatNN". Of course, I won't argue that "not
> breaking everything" is even more important !!
>
> Coming back to Travis proposition "bit-pattern approaches to missing
> data (*at least* for float64 and int32) need to be implemented.", I
> wonder what is the amount of extra work to go from nafloat64 to
> nafloat32/16 ? Is there an hardware support NaN payloads with these
> smaller floats ? If not, or if it is too complicated, I feel it is
> acceptable to say "it's too complicated" and fall back to mask. One may
> have to choose between fancy types and fancy NAs...
>
>
I'm in agreement here, and that was a major consideration in making a
'masked' implementation first. Also, different folks adopt different values
for 'missing' data, and distributing one or several masks along with the
data is another common practice.

One inconvenience I have run into with the current API is that is should be
easier to clear the mask from an "ignored" value without taking a new view
or assigning known data. So maybe two types of masks (different payloads),
or an additional flag could be helpful. The process of assigning masks
could also be made a bit easier than using fancy indexing.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120307/67ab43c2/attachment.html>