Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

6 Jul 2011


      On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith <njs@pobox.com> wrote:
...
Well, everyone seems to like my first attempt at this so far, so I
guess I'll really stick my foot in it now... here's my second miniNEP,
which lays out a plan for handling dtype/bit-pattern-style NAs. I've
stolen bits of text from both the NEP and the alterNEP for this, but
since the focus is on nailing down the details, most of the content is
new.
There are many FIXME's noted, where some decisions or more work is
needed... the idea here is to lay out some specifics, so we can figure
out if the idea will work and get the details right. So feedback is
*very* welcome!
Master version:
 https://gist.github.com/1068264
Current version for commenting:
#######################################
miniNEP 2: NA support via special dtypes
#######################################
To try and make more progress on the whole missing values/masked
arrays/... debate, it seems useful to have a more technical discussion
of the pieces which we *can* agree on. This is the second, which
attempts to nail down the details of how NAs can be implemented using
special dtype's.
*****************
Table of contents
*****************
.. contents::
*********
Rationale
*********
An ordinary value is something like an integer or a floating point
number. A missing value is a placeholder for an ordinary value that is
for some reason unavailable. For example, in working with statistical
data, we often build tables in which each row represents one item, and
each column represents properties of that item. For instance, we might
take a group of people and for each one record height, age, education
level, and income, and then stick these values into a table. But then
we discover that our research assistant screwed up and forgot to
record the age of one of our individuals. We could throw out the rest
of their data as well, but this would be wasteful; even such an
incomplete row is still perfectly usable for some analyses (e.g., we
can compute the correlation of height and income). The traditional way
to handle this would be to stick some particular meaningless value in
for the missing data, e.g., recording this person's age as 0. But this
is very error prone; we may later forget about these special values
while running other analyses, and discover to our surprise that babies
have higher incomes than teenagers. (In this case, the solution would
be to just leave out all the items where we have no age recorded, but
this isn't a general solution; many analyses require something more
clever to handle missing values.) So instead of using an ordinary
value like 0, we define a special "missing" value, written "NA" for
"not available".
There are several possible ways to represent such a value in memory.
For instance, we could reserve a specific value (like 0, or a
particular NaN, or the smallest negative integer) and then ensure that
this value is treated specially by all arithmetic and other operations
on our array. Another option would be to add an additional mask array
next to our main array, use this to indicate which values should be
treated as NA, and then extend our array operations to check this mask
array whenever performing computations. Each implementation approach
has various strengths and weaknesses, but here we focus on the former
(value-based) approach exclusively and leave the possible addition of
the latter to future discussion. The core advantages of this approach
are (1) it adds no additional memory overhead, (2) it is
straightforward to store and retrieve such arrays to disk using
existing file storage formats, (3) it allows binary compatibility with
R arrays including NA values, (4) it is compatible with the common
practice of using NaN to indicate missingness when working with
floating point numbers, (5) the dtype is already a place where `weird
things can happen' -- there are a wide variety of dtypes that don't
act like ordinary numbers (including structs, Python objects,
fixed-length strings, ...), so code that accepts arbitrary numpy
arrays already has to be prepared to handle these (even if only by
checking for them and raising an error). Therefore adding yet more new
dtypes has less impact on extension authors than if we change the
ndarray object itself.
The basic semantics of NA values are as follows. Like any other value,
they must be supported by your array's dtype -- you can't store a
floating point number in an array with dtype=int32, and you can't
store an NA in it either. You need an array with dtype=NAint32 or
something (exact syntax to be determined). Otherwise, NA values act
exactly like any other values. In particular, you can apply arithmetic
functions and so forth to them. By default, any function which takes
an NA as an argument always returns an NA as well, regardless of the
values of the other arguments. This ensures that if we try to compute
the correlation of income with age, we will get "NA", meaning "given
that some of the entries could be anything, the answer could be
anything as well". This reminds us to spend a moment thinking about
how we should rephrase our question to be more meaningful. And as a
convenience for those times when you do decide that you just want the
correlation between the known ages and income, then you can enable
this behavior by adding a single argument to your function call.
For floating point computations, NAs and NaNs have (almost?) identical
behavior. But they represent different things -- NaN an invalid
computation like 0/0, NA a value that is not available -- and
distinguishing between these things is useful because in some
situations they should be treated differently. (For example, an
imputation procedure should replace NAs with imputed values, but
probably should leave NaNs alone.) And anyway, we can't use NaNs for
integers, or strings, or booleans, so we need NA anyway, and once we
have NA support for all these types, we might as well support it for
floating point too for consistency.
****************
General strategy
****************
Numpy already has a general mechanism for defining new dtypes and
slotting them in so that they're supported by ndarrays, by the casting
machinery, by ufuncs, and so on. In principle, we could implement
Well, actually not in any useful sense, take a look at what Mark went
through for the half floats. There is a reason the NEP went with
parametrized dtypes and masks. But we would sure welcome a plan and code to
make it true, it is one of the areas that could really use improvement.

<snip>

Chuck

Re: [Numpy-discussion] miniNEP 2: NA support via special dtypes

Charles R Harris