[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 12:54:23 EDT 2011

On Fri, Jun 24, 2011 at 9:33 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> But on the other hand, we gain:
>>  -- simpler implementation: no need to be checking and tracking the
>> mask buffer everywhere. The needed infrastructure is already built in.
>
> I don't believe this is true. The dtype mechanism would need a lot of work
> to build that needed infrastructure first. The analysis I've done so far
> indicates the masked approach will give a simpler/cleaner implementation.

Really? One implementation option would be the one I described, where
it just uses the standard dtype extension machinery. AFAICT the only
core change that would be needed to allow that is for us to add
another argument to the PyUFuncGenericFunction signature, which
contains the dtype(s) of the array(s) being operated on. The other
option would be to add a special case to the ufunc looping code, so
that if we have a 'maybe' dtype it would check the NA flag itself
before calling down to the actual ufunc's. (This is more intrusive,
but not any more intrusive than the masking approach.)

Neither approach seems to require any infrastructure changes as a
prerequisite to me, so probably I'm missing something. What problems
are you thinking of?

>>  -- simpler conceptually: we already have the dtype concept, it's a
>> very powerful and we use it for all sorts of things; using it here too
>> plays to our strengths. We already know what a numpy scalar is and how
>> it works. Everyone already understands how assigning a value to an
>> element of an array works, how it interacts with broadcasting, etc.,
>> etc., and in this model, that's all a missing value is -- just another
>> value.
>
> From Python, this aspect of things would be virtually identical between the
> two mechanisms. The dtype approach would require more coding and overhead
> where you have to create copies of your data to convert it into the
> parameterized "NA[int32]" dtype, versus with the masked approach where you
> say x.flags.hasmask = True or something like that without copying the data.

Yes, converting from a regular dtype to an NA-ful dtype would require
a copy, but that's true for any dtype conversion. The solution is the
same as always -- just use the correct dtype from the start. Is there
some reason why people would often start with the 'wrong' dtype and
then need to convert?

-- Nathaniel