Re: [Numpy-discussion] new MaskedArray class

June 23, 2019

      On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk <
m.h.vankerkwijk@gmail.com> wrote:
...
Hi Allan,
I'm not sure I would go too much by what the old MaskedArray class did. It
indeed made an effort not to overwrite masked values with a new result,
even to the extend of copying back masked input data elements to the output
data array after an operation. But the fact that this is non-sensical if
the dtype changes (or the units in an operation on quantities) suggests
that this mental model simply does not work.
I think a sensible alternative mental model for the MaskedArray class is
that all it does is forward any operations to the data it holds and
separately propagate a mask,
I'm generally on-board with that mental picture, and agree that the
use-case described by Ben (different layers of satellite imagery) is
important.  Same thing happens in astronomy data, e.g. you have a CCD image
of the sky and there are cosmic rays that contaminate the image.  Those are
not garbage data, just pixels that one wants to ignore in some, but not
all, contexts.

However, it's worth noting that one cannot blindly forward any operations
to the data it holds since the operation may be illegal on that data.  The
simplest example is dividing `a / b` where  `b` has data values of 0 but
they are masked.  That operation should succeed with no exception, and here
the resultant value under the mask is genuinely garbage.

The current MaskedArray seems a bit inconsistent in dealing with invalid
calcuations.  Dividing by 0 (if masked) is no problem and returns the
numerator.  Taking the log of a masked 0 gives the usual divide by zero
RuntimeWarning and puts a 1.0 under the mask of the output.

Perhaps the expression should not even be evaluated on elements where the
output mask is True, and all the masked output data values should be set to
a predictable value (e.g. zero for numerical, zero-length string for
string, or maybe a default fill value).  That at least provides consistent
and predictable behavior that is simple to explain.  Otherwise the story is
that the data under the mask *might* be OK, unless for a particular element
the computation was invalid in which case it is filled with some arbitrary
value.  I think that is actually an error-prone behavior that should be
avoided.

- Tom
...
ORing elements together for binary operations, etc., and explicitly
skipping masked elements in reductions (ideally using `where` to be as
agnostic as possible about the underlying data, for which, e.g., setting
masked values to `0` for `np.reduce.add` may or may not be the right thing
to do - what if they are string?).
With this mental picture, the underlying data are always have well-defined
meaning: they have been operated on as if the mask did not exist. There
then is also less reason to try to avoid getting it back to the user.
As a concrete example (maybe Ben has others): in astropy we have a
sigma-clipping average routine, which uses a `MaskedArray` to iteratively
mask items that are too far off from the mean; here, the mask varies each
iteration (an initially masked element can come back into play), but the
data do not.
All the best,
Marten
On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com>
wrote:
...
On 6/21/19 2:37 PM, Benjamin Root wrote:
...
Just to note, data that is masked isn't always garbage. There are plenty
of use-cases where one may want to temporarily apply a mask for a set of
computation, or possibly want to apply a series of different masks to
the data. I haven't read through this discussion deeply enough, but is
this new class going to destroy underlying masked data? and will it be
possible to swap out masks?
Cheers!
Ben Root
Indeed my implementation currently feels free to clobber the data at
masked positions and makes no guarantees not to.
I'd like to try to support reasonable use-cases like yours though. A few
thoughts:
First, the old np.ma.MaskedArray explicitly does not promise to preserve
masked values, with a big warning in the docs. I can't recall the
examples, but I remember coming across cases where clobbering happens.
So arguably your behavior was never supported, and perhaps this means
that no-clobber behavior is difficult to reasonably support.
Second, the old np.ma.MaskedArray avoids frequent clobbering by making
lots of copies. Therefore, in most cases you will not lose any
performance in my new MaskedArray relative to the old one by making an
explicit copy yourself. I.e, is it problematic to have to do
>>> result = MaskedArray(data.copy(), trial_mask).sum()
instead of
>>> marr.mask = trial_mask
     >>> result = marr.sum()
since they have similar performance?
Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c, the values
at masked position of the result of (a+b) do not correspond to the sum
of the masked values in a and b. Thus, by "unmasking" c you are exposing
nonsense values, which to me seems likely to cause heisenbugs.
In summary, by not making no-clobber guarantees and by strictly
preventing exposure of nonsense values, I suspect that: 1. my new code
is simpler and faster by avoiding lots of copies, and forces copies to
be explicit in user code. 2. disallowing direct modification of the mask
lowers the "API surface area" making people's MaskedArray code less
buggy and easier to read: Exposure of nonsense values by "unmasking" is
one less possibility to keep in mind.
Best,
Allan
...
On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com
<mailto:allanhaldane@gmail.com>> wrote:
On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
    > Hi Allan,
    >
    > This is very impressive! I could get the tests that I wrote for my
    class
    > pass with yours using Quantity with what I would consider very
minimal
    > changes. I only could not find a good way to unmask data (I like
the
    > idea of setting the mask on some elements via `ma[item] = X`); is
this
    > on purpose?
Yes, I want to make it difficult for the user to access the garbage
    values under the mask, which are often clobbered values. The only
way to
    "remove" a masked value is by replacing it with a new non-masked
value.
> Anyway, it would seem easily at the point where I should comment
    on your
    > repository rather than in the mailing list!
To make further progress on this encapsulation idea I need a more
    complete ducktype to pass into MaskedArray to test, so that's what
I'll
    work on next, when I have time. I'll either try to finish my
    ArrayCollection type, or try making a simple NDunit ducktype
    piggybacking on astropy's Unit.
Best,
    Allan
>
    > All the best,
    >
    > Marten
    >
    >
    > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
    <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>
    > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>
    wrote:
    >
    >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
    >     >
    >     >
    >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
    >     <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>
    <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>
    >     > <mailto:allanhaldane@gmail.com
    <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com
    <mailto:allanhaldane@gmail.com>>>>
    >     wrote:
    >     > <snip>
    >     >
    >     >     > This may be too much to ask from the initializer,
but, if
    >     so, it still
    >     >     > seems most useful if it is made as easy as possible
to do,
    >     say, `class
    >     >     > MaskedQuantity(Masked, Quantity): <very few
overrides>`.
    >     >
    >     >     Currently MaskedArray does not accept ducktypes as
    underlying
    >     arrays,
    >     >     but I think it shouldn't be too hard to modify it to do
so.
    >     Good idea!
    >     >
    >     >
    >     > Looking back at my trial, I see that I also never got to
    duck arrays -
    >     > only ndarray subclasses - though I tried to make the code as
    >     agnostic as
    >     > possible.
    >     >
    >     > (Trial at
    >     >
    >
https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...
)
...
>     >
    >     >     I already partly navigated this mixin-issue in the
    >     >     "MaskedArrayCollection" class, which essentially does
    >     >     ArrayCollection(MaskedArray(array)), and only takes
about 30
...
>     lines of
    >     >     boilerplate. That's the backwards encapsulation order
from
...
>     what you want
    >     >     though.
    >     >
    >     >
    >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
    u.m,
    >     > mask=[True, False, False])` does indeed not have a `.unit`
    attribute
    >     > (and cannot represent itself...); I'm not at all sure that
my
...
>     method of
    >     > just creating a mixed class is anything but a recipe for
    disaster,
    >     though!
    >
    >     Based on your suggestion I worked on this a little today, and
    now my
    >     MaskedArray more easily encapsulates both ducktypes and
ndarray
...
>     subclasses (pushed to repo). Here's an example I got working
    with masked
    >     units using unyt:
    >
    >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
    >
    >     [2]: from unyt import m, km
    >
    >     [3]: import numpy as np
    >
    >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
    >
    >     [5]: uarr
    >
    >     MaskedArray([1., X , 3.])
    >     [6]: uarr + 1*m
    >
    >     MaskedArray([1.001, X    , 3.001])
    >     [7]: uarr.filled()
    >
    >     unyt_array([1., 0., 3.], 'km')
    >     [8]: np.concatenate([uarr, 2*uarr]).filled()
    >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
    >
    >     The catch is the ducktype/subclass has to rigorously follow
    numpy's
    >     indexing rules, including distinguishing 0d arrays from
    scalars. For now
    >     only I used unyt in the example above since it happens to be
    less strict
    >      about dimensionless operations than astropy.units which trips
    up my
    >     repr code. (see below for example with astropy.units). Note in
    the last
    >     line I lost the dimensions, but that is because unyt does not
    handle
    >     np.concatenate. To get that to work we need a true ducktype
    for units.
    >
    >     The example above doesn't expose the ".units" attribute
    outside the
    >     MaskedArray, and it doesn't print the units in the repr. But
    you can
    >     access them using "filled".
    >
    >     While I could make MaskedArray forward unknown attribute
    accesses to the
    >     encapsulated array, that seems a bit dangerous/bug-prone at
first
...
>     glance, so probably I want to require the user to make a
    MaskedArray
    >     subclass to do so. I've just started playing with that
    (probably buggy),
    >     and Ive attached subclass examples for astropy.unit and unyt,
    with some
    >     example output below.
    >
    >     Cheers,
    >     Allan
    >
    >
    >
    >     Example using the attached astropy unit subclass:
    >
    >         >>> from astropy.units import m, km, s
    >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
    >         >>> uarr
    >         MaskedQ([1., X , 1.], units=km)
    >         >>> uarr.units
    >         km
    >         >>> uarr + (1*m)
    >         MaskedQ([1.001, X    , 1.001], units=km)
    >         >>> uarr/(1*s)
    >         MaskedQ([1., X , 1.], units=km / s)
    >         >>> (uarr*(1*m))[1:]
    >         MaskedQ([X , 1.], units=km m)
    >         >>> np.add.outer(uarr, uarr)
    >         MaskedQ([[2., X , 2.],
    >                  [X , X , X ],
    >                  [2., X , 2.]], units=km)
    >         >>> print(uarr)
    >         [1. X  1.] km m
    >
    >     Cheers,
    >     Allan
    >
    >
    >     >     > Even if this impossible, I think it is conceptually
useful
...
>     to think
    >     >     > about what the masking class should do. My sense is
that,
...
>     e.g., it
    >     >     > should not attempt to decide when an operation
    succeeds or not,
    >     >     but just
    >     >     > "or together" input masks for regular, multiple-input
    functions,
    >     >     and let
    >     >     > the underlying arrays skip elements for reductions by
    using
    >     `where`
    >     >     > (hey, I did implement that for a reason... ;-). In
    >     particular, it
    >     >     > suggests one should not have things like domains and
all
...
>     that (I never
    >     >     > understood why `MaskedArray` did that). If one wants
more,
...
>     the class
    >     >     > should provide a method that updates the mask (a
sensible
...
>     default
    >     >     might
    >     >     > be `mask |= ~np.isfinite(result)` - here, the class
    being masked
    >     >     should
    >     >     > logically support ufuncs and functions, so it can
    decide what
    >     >     "isfinite"
    >     >     > means).
    >     >
    >     >     I agree it would be nice to remove domains. It would
    make life
    >     easier,
    >     >     and I could remove a lot of twiddly code! I kept it in
    for now to
    >     >     minimize the behavior changes from the old MaskedArray.
    >     >
    >     >
    >     > That makes sense. Could be separated out to a
    backwards-compatibility
    >     > class later.
    >     >
    >     >
    >     >     > In any case, I would think that a basic truth should
    be that
    >     >     everything
    >     >     > has a mask with a shape consistent with the data, so
    >     >     > 1. Each complex numbers has just one mask, and setting
    >     `a.imag` with a
    >     >     > masked array should definitely propagate the mask.
    >     >     > 2. For a masked array with structured dtype, I'd
    similarly say
    >     >     that the
    >     >     > default is for a mask to have the same shape as the
array.
...
>     But that
    >     >     > something like your collection makes sense for the
case
...
>     where one
    >     >     wants
    >     >     > to mask items in a structure.
    >     >
    >     >     Agreed that we should have a single bool per complex or
    structured
    >     >     element, and the mask shape is the same as the array
shape.
...
>     That's how I
    >     >     implemented it. But there is still a problem with
    complex.imag
    >     >     assignment:
    >     >
    >     >         >>> a = MaskedArray([1j, 2, X])
    >     >         >>> i = a.imag
    >     >         >>> i[:] = MaskedArray([1, X, 1])
    >     >
    >     >     If we make the last line copy the mask to the original
    array, what
    >     >     should the real part of a[2] be? Conversely, if we don't
    copy
    >     the mask,
    >     >     what should the imag part of a[1] be? It seems like we
might
...
>     "want" the
    >     >     masks to be OR'd instead, but then should i[2] be masked
    after
    >     we just
    >     >     set it to 1?
    >     >
    >     > Ah, I see the issue now... Easiest to implement and closest
    in analogy
    >     > to a regular view would be to just let it unmask a[2] (with
    >     whatever is
    >     > in real; user beware!).
    >     >
    >     > Perhaps better would be to special-case such that `imag`
    returns a
    >     > read-only view of the mask. Making `imag` itself read-only
would
...
>     prevent
    >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -
but
...
>     there is
    >     > no reason this should update the mask.
    >     >
    >     > Still, neither is really satisfactory...
    >     >
    >     >
    >     >
    >     >     > p.s. I started trying to implement the above "Mixin"
    class; will
    >     >     try to
    >     >     > clean that up a bit so that at least it uses `where`
and
...
>     push it up.
    >     >
    >     >     I played with "where", but didn't include it since 1.17
    is not
    >     released.
    >     >     To avoid duplication of effort, I've attached a diff of
    what I
    >     tried. I
    >     >     actually get a slight slowdown of about 10% by using
    where...
    >     >
    >     >
    >     > Your implementation is indeed quite similar to what I got in
    >     > __array_ufunc__ (though one should "&" the where with
~mask).
...
>     >
    >     > I think the main benefit is not to presume that whatever is
    underneath
    >     > understands 0 or 1, i.e., avoid filling.
    >     >
    >     >
    >     >     If you make progress with the mixin, a push is welcome.
I
...
>     imagine a
    >     >     problem is going to be that np.isscalar doesn't work to
    detect
    >     duck
    >     >     scalars.
    >     >
    >     > I fear that in my attempts I've simply decided that only
    array scalars
    >     > exist...
    >     >
    >     > -- Marten
    >     >
    >     > _______________________________________________
    >     > NumPy-Discussion mailing list
    >     > NumPy-Discussion@python.org
    <mailto:NumPy-Discussion@python.org>
    <mailto:NumPy-Discussion@python.org
    <mailto:NumPy-Discussion@python.org>>
    >     > https://mail.python.org/mailman/listinfo/numpy-discussion
    >     >
    >
    >     _______________________________________________
    >     NumPy-Discussion mailing list
    >     NumPy-Discussion@python.org
    <mailto:NumPy-Discussion@python.org>
    <mailto:NumPy-Discussion@python.org
    <mailto:NumPy-Discussion@python.org>>
    >     https://mail.python.org/mailman/listinfo/numpy-discussion
    >
    >
    > _______________________________________________
    > NumPy-Discussion mailing list
    > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>
    > https://mail.python.org/mailman/listinfo/numpy-discussion
    >
_______________________________________________
    NumPy-Discussion mailing list
    NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>
    https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Aldcroft, Thomas