[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 15:26:05 EDT 2011

On Fri, Jun 24, 2011 at 11:54 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Fri, Jun 24, 2011 at 9:33 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> But on the other hand, we gain:
> >>  -- simpler implementation: no need to be checking and tracking the
> >> mask buffer everywhere. The needed infrastructure is already built in.
> >
> > I don't believe this is true. The dtype mechanism would need a lot of
> work
> > to build that needed infrastructure first. The analysis I've done so far
> > indicates the masked approach will give a simpler/cleaner implementation.
>
> Really? One implementation option would be the one I described, where
> it just uses the standard dtype extension machinery. AFAICT the only
> core change that would be needed to allow that is for us to add
> another argument to the PyUFuncGenericFunction signature, which
> contains the dtype(s) of the array(s) being operated on. The other
> option would be to add a special case to the ufunc looping code, so
> that if we have a 'maybe' dtype it would check the NA flag itself
> before calling down to the actual ufunc's. (This is more intrusive,
> but not any more intrusive than the masking approach.)
>

Having extended the ufuncs in a direction towards supporting parameterized
types (with just some baby steps) for the datetime64, the idea that dtype
deviating from a straightforward number will work with "standard dtype
extension machinery" doesn't seem right to me.

For the maybe dtype, it would need to gain access to the ufunc loop of the
underlying dtype, and call it appropriately during the inner loop. This
appears to require some more invasive upheaval within the ufunc code than
the masking approach.

Neither approach seems to require any infrastructure changes as a
> prerequisite to me, so probably I'm missing something. What problems
> are you thinking of?
>

Adding nditer, re-enabling ABI compatibility for 1.6, making datetime64 work
reasonably as a parameterized type, these are all things that required a
fair bit of changes to NumPy's infrastructure. Making either of these
missing value designs, which are doing something new that hasn't previously
been done in NumPy at a C level, will definitely require something similar.

>>  -- simpler conceptually: we already have the dtype concept, it's a
> >> very powerful and we use it for all sorts of things; using it here too
> >> plays to our strengths. We already know what a numpy scalar is and how
> >> it works. Everyone already understands how assigning a value to an
> >> element of an array works, how it interacts with broadcasting, etc.,
> >> etc., and in this model, that's all a missing value is -- just another
> >> value.
> >
> > From Python, this aspect of things would be virtually identical between
> the
> > two mechanisms. The dtype approach would require more coding and overhead
> > where you have to create copies of your data to convert it into the
> > parameterized "NA[int32]" dtype, versus with the masked approach where
> you
> > say x.flags.hasmask = True or something like that without copying the
> data.
>
> Yes, converting from a regular dtype to an NA-ful dtype would require
> a copy, but that's true for any dtype conversion. The solution is the
> same as always -- just use the correct dtype from the start. Is there
> some reason why people would often start with the 'wrong' dtype and
> then need to convert?
>

Here's a possible scenario:

Someone has a large binary data file, say 1GB or so, that they have memmaped
to a NumPy array. Now they want to apply some criteria to determine which
rows to keep and which to ignore. Being able to add a mask to this array and
treat it with the missing data mechanism seems like it would be very
attractive here.

-Mark

>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/96d434d2/attachment.html>