On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith <njs@pobox.com> wrote:
Well, everyone seems to like my first attempt at this so far, so I guess I'll really stick my foot in it now... here's my second miniNEP, which lays out a plan for handling dtype/bit-pattern-style NAs. I've stolen bits of text from both the NEP and the alterNEP for this, but since the focus is on nailing down the details, most of the content is new.
There are many FIXME's noted, where some decisions or more work is needed... the idea here is to lay out some specifics, so we can figure out if the idea will work and get the details right. So feedback is *very* welcome!
Master version: https://gist.github.com/1068264
Current version for commenting:
####################################### miniNEP 2: NA support via special dtypes #######################################
To try and make more progress on the whole missing values/masked arrays/... debate, it seems useful to have a more technical discussion of the pieces which we *can* agree on. This is the second, which attempts to nail down the details of how NAs can be implemented using special dtype's.
***************** Table of contents *****************
.. contents::
********* Rationale *********
An ordinary value is something like an integer or a floating point number. A missing value is a placeholder for an ordinary value that is for some reason unavailable. For example, in working with statistical data, we often build tables in which each row represents one item, and each column represents properties of that item. For instance, we might take a group of people and for each one record height, age, education level, and income, and then stick these values into a table. But then we discover that our research assistant screwed up and forgot to record the age of one of our individuals. We could throw out the rest of their data as well, but this would be wasteful; even such an incomplete row is still perfectly usable for some analyses (e.g., we can compute the correlation of height and income). The traditional way to handle this would be to stick some particular meaningless value in for the missing data, e.g., recording this person's age as 0. But this is very error prone; we may later forget about these special values while running other analyses, and discover to our surprise that babies have higher incomes than teenagers. (In this case, the solution would be to just leave out all the items where we have no age recorded, but this isn't a general solution; many analyses require something more clever to handle missing values.) So instead of using an ordinary value like 0, we define a special "missing" value, written "NA" for "not available".
There are several possible ways to represent such a value in memory. For instance, we could reserve a specific value (like 0, or a particular NaN, or the smallest negative integer) and then ensure that this value is treated specially by all arithmetic and other operations on our array. Another option would be to add an additional mask array next to our main array, use this to indicate which values should be treated as NA, and then extend our array operations to check this mask array whenever performing computations. Each implementation approach has various strengths and weaknesses, but here we focus on the former (value-based) approach exclusively and leave the possible addition of the latter to future discussion. The core advantages of this approach are (1) it adds no additional memory overhead, (2) it is straightforward to store and retrieve such arrays to disk using existing file storage formats, (3) it allows binary compatibility with R arrays including NA values, (4) it is compatible with the common practice of using NaN to indicate missingness when working with floating point numbers, (5) the dtype is already a place where `weird things can happen' -- there are a wide variety of dtypes that don't act like ordinary numbers (including structs, Python objects, fixed-length strings, ...), so code that accepts arbitrary numpy arrays already has to be prepared to handle these (even if only by checking for them and raising an error). Therefore adding yet more new dtypes has less impact on extension authors than if we change the ndarray object itself.
The basic semantics of NA values are as follows. Like any other value, they must be supported by your array's dtype -- you can't store a floating point number in an array with dtype=int32, and you can't store an NA in it either. You need an array with dtype=NAint32 or something (exact syntax to be determined). Otherwise, NA values act exactly like any other values. In particular, you can apply arithmetic functions and so forth to them. By default, any function which takes an NA as an argument always returns an NA as well, regardless of the values of the other arguments. This ensures that if we try to compute the correlation of income with age, we will get "NA", meaning "given that some of the entries could be anything, the answer could be anything as well". This reminds us to spend a moment thinking about how we should rephrase our question to be more meaningful. And as a convenience for those times when you do decide that you just want the correlation between the known ages and income, then you can enable this behavior by adding a single argument to your function call.
For floating point computations, NAs and NaNs have (almost?) identical behavior. But they represent different things -- NaN an invalid computation like 0/0, NA a value that is not available -- and distinguishing between these things is useful because in some situations they should be treated differently. (For example, an imputation procedure should replace NAs with imputed values, but probably should leave NaNs alone.) And anyway, we can't use NaNs for integers, or strings, or booleans, so we need NA anyway, and once we have NA support for all these types, we might as well support it for floating point too for consistency.
**************** General strategy ****************
Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement
Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. <snip> Chuck