[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 12:55:42 EDT 2011

On Fri, Jun 24, 2011 at 8:57 AM, Keith Goodman <kwgoodman at gmail.com> wrote:

> On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman <kwgoodman at gmail.com>
> wrote:
> >>
> >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> >> > Enthought has asked me to look into the "missing data" problem and how
> >> > NumPy
> >> > could treat it better. I've considered the different ideas of adding
> >> > dtype
> >> > variants with a special signal value and masked arrays, and concluded
> >> > that
> >> > adding masks to the core ndarray appears is the best way to deal with
> >> > the
> >> > problem in general.
> >> > I've written a NEP that proposes a particular design, viewable here:
> >> >
> >> >
> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
> >> > There are some questions at the bottom of the NEP which definitely
> need
> >> > discussion to find the best design choices. Please read, and let me
> know
> >> > of
> >> > all the errors and gaps you find in the document.
> >>
> >> Wow, that is exciting.
> >>
> >> I wonder about the relative performance of the two possible
> >> implementations (mask and NA) in the PEP.
> >
> > I've given that some thought, and I don't think there's a clear way to
> tell
> > what the performance gap would be without implementations of both to
> > benchmark against each other. I favor the mask primarily because it
> provides
> > masking for all data types in one go with a single consistent interface
> to
> > program against. For adding NA signal values, each new data type would
> need
> > a lot of work to gain the same level of support.
> >>
> >> If you are, say, doing a calculation along the columns of a 2d array
> >> one element at a time, then you will need to grab an element from the
> >> array and grab the corresponding element from the mask. I assume the
> >> corresponding data and mask elements are not stored together. That
> >> would be slow since memory access is usually were time is spent. In
> >> this regard NA would be faster.
> >
> > Yes, the masks add more memory traffic and some extra calculation, while
> the
> > NA signal values just require some additional calculations.
>
> I guess a better example would have been summing along rows instead of
> columns of a large C order array. If one needs to look at both the
> data and the mask then wouldn't summing along rows in cython be about
> as slow as it is currently to sum along columns?
>

Not quite, both the mask and the array data are being traversed coherently,
so it isn't jumping around in memory like in the columns case you're
describing.

> >> I currently use NaN as a missing data marker. That adds things like
> >> this to my cython code:
> >>
> >>    if a[i] == a[i]:
> >>        asum += a[i]
> >>
> >> If NA also had the property NA == NA is False, then it would be easy
> >> to use.
> >
> > That's what I believe it should do, and I guess this is a strike against
> the
> > idea of returning None for a single missing value.
>
> If NA == NA is False then I wouldn't need to look at the mask in the
> example above. Or would ndarray have to look at the mask in order to
> return NA for a[i]? Which would mean __getitem__ would need to look at
> the mask?
>

What R does is return NA for NA == NA. Then, if you try to use it as a
boolean, it throws an exception. I like this approach.

If the missing value is returned as a 0d array (so that NA == NA is
> False), would that break cython in a fundamental way since it could
> not always return a same-sized scalar when you index into an array?
>

I don't know enough about Cython internals to comment, sorry.

-Mark

>> A mask, on the other hand, would be more difficult for third
> >> party packages to support. You have to check if the mask is present
> >> and if so do a mask-aware calculation; if is it not present then you
> >> have to do a non-mask based calculation.
> >
> > I actually see the mask as being easier for third party packages to
> support,
> > particularly from C. Having regular C-friendly values with a boolean mask
> is
> > a lot friendlier than values that require a lot of special casing like
> the
> > NA signal values would require.
> >
> >>
> >> So you have two code paths.
> >> You also need to check if any of the input arrays have masks and if so
> >> apply masks to the other inputs, etc.
> >
> > Most of the time, the masks will transparently propagate or not along
> with
> > the arrays, with no effort required. In Python, the code you write would
> be
> > virtually the same between the two approaches.
> > -Mark
> >
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/2ed93823/attachment.html>