[Numpy-discussion] missing data discussion round 2

Thu Jun 30 15:52:47 EDT 2011

On Thu, Jun 30, 2011 at 12:27 PM, Eric Firing <efiring at hawaii.edu> wrote:
> On 06/30/2011 08:53 AM, Nathaniel Smith wrote:
>> On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing<efiring at hawaii.edu>  wrote:
>>> In addition, for new code, the full-blown masked array module may not be
>>> needed.  A convenience it adds, however, is the automatic masking of
>>> invalid values:
>>>
>>> In [1]: np.ma.log(-1)
>>> Out[1]: masked
>>>
>>> I'm sure this horrifies some, but there are times and places where it is
>>> a genuine convenience, and preferable to having to use a separate
>>> operation to replace nan or inf with NA or whatever it ends up being.
>>
>> Err, but what would this even get you? NA, NaN, and Inf basically all
>> behave the same WRT floating point operations anyway, i.e., they all
>> propagate?
>
> Not exactly. First, it depends on np.seterr;

IIUC, you're proposing to make this conversion depend on np.seterr
too, though, right?

> second, calculations on NaN
> can be very slow, so are better avoided entirely

They're slow because inside the processor they require a branch and a
separate code path (which doesn't get a lot of transistors allocated
to it). In any of the NA proposals we're talking about, handling an NA
would require a software branch and a separate code path (which is in
ordinary software, now, so it doesn't get any special transistors
allocated to it...). I don't think masking support is likely to give
you a speedup over the processor's NaN handling.

And if it did, that would mean that we speed up FP operations in
general by checking for NaN in software, so then we should do that
everywhere anyway instead of making it an NA-specific feature...

> third, if an array is
> passed to extension code, it is much nicer if that code only has one NA
> value to handle, instead of having to check for all possible "bad" values.

I'm pretty sure that Mark's proposal does not work this way -- he's
saying that the NA-checking code in numpy could optionally check for
all these different "bad" values and handle them the same in ufuncs,
not that we would check the outputs of all FP operations for "bad"
values and then replace them by NA. So your extension code would still
have the same problem. Sorry :-(

-- Nathaniel