Re: [Numpy-discussion] new MaskedArray class

June 19, 2019

      On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
...
On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com
<mailto:allanhaldane@gmail.com>> wrote:
<snip>
> This may be too much to ask from the initializer, but, if so, it still
    > seems most useful if it is made as easy as possible to do, say, `class
    > MaskedQuantity(Masked, Quantity): <very few overrides>`.
Currently MaskedArray does not accept ducktypes as underlying arrays,
    but I think it shouldn't be too hard to modify it to do so. Good idea!
Looking back at my trial, I see that I also never got to duck arrays -
only ndarray subclasses - though I tried to make the code as agnostic as
possible.
(Trial at
https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...)
I already partly navigated this mixin-issue in the
    "MaskedArrayCollection" class, which essentially does
    ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
    boilerplate. That's the backwards encapsulation order from what you want
    though.
Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
mask=[True, False, False])` does indeed not have a `.unit` attribute
(and cannot represent itself...); I'm not at all sure that my method of
just creating a mixed class is anything but a recipe for disaster, though!
Based on your suggestion I worked on this a little today, and now my
MaskedArray more easily encapsulates both ducktypes and ndarray
subclasses (pushed to repo). Here's an example I got working with masked
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.])
[6]: uarr + 1*m

MaskedArray([1.001, X    , 3.001])
[7]: uarr.filled()

unyt_array([1., 0., 3.], 'km')
[8]: np.concatenate([uarr, 2*uarr]).filled()
unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's
indexing rules, including distinguishing 0d arrays from scalars. For now
only I used unyt in the example above since it happens to be less strict
 about dimensionless operations than astropy.units which trips up my
repr code. (see below for example with astropy.units). Note in the last
line I lost the dimensions, but that is because unyt does not handle
np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the
MaskedArray, and it doesn't print the units in the repr. But you can
access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the
encapsulated array, that seems a bit dangerous/bug-prone at first
glance, so probably I want to require the user to make a MaskedArray
subclass to do so. I've just started playing with that (probably buggy),
and Ive attached subclass examples for astropy.unit and unyt, with some
example output below.

Cheers,
Allan

Example using the attached astropy unit subclass:

    >>> from astropy.units import m, km, s
    >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
    >>> uarr
    MaskedQ([1., X , 1.], units=km)
    >>> uarr.units
    km
    >>> uarr + (1*m)
    MaskedQ([1.001, X    , 1.001], units=km)
    >>> uarr/(1*s)
    MaskedQ([1., X , 1.], units=km / s)
    >>> (uarr*(1*m))[1:]
    MaskedQ([X , 1.], units=km m)
    >>> np.add.outer(uarr, uarr)
    MaskedQ([[2., X , 2.],
             [X , X , X ],
             [2., X , 2.]], units=km)
    >>> print(uarr)
    [1. X  1.] km m

Cheers,
Allan
...
> Even if this impossible, I think it is conceptually useful to think
    > about what the masking class should do. My sense is that, e.g., it
    > should not attempt to decide when an operation succeeds or not,
    but just
    > "or together" input masks for regular, multiple-input functions,
    and let
    > the underlying arrays skip elements for reductions by using `where`
    > (hey, I did implement that for a reason... ;-). In particular, it
    > suggests one should not have things like domains and all that (I never
    > understood why `MaskedArray` did that). If one wants more, the class
    > should provide a method that updates the mask (a sensible default
    might
    > be `mask |= ~np.isfinite(result)` - here, the class being masked
    should
    > logically support ufuncs and functions, so it can decide what
    "isfinite"
    > means).
I agree it would be nice to remove domains. It would make life easier,
    and I could remove a lot of twiddly code! I kept it in for now to
    minimize the behavior changes from the old MaskedArray.
That makes sense. Could be separated out to a backwards-compatibility
class later.
> In any case, I would think that a basic truth should be that
    everything
    > has a mask with a shape consistent with the data, so
    > 1. Each complex numbers has just one mask, and setting `a.imag` with a
    > masked array should definitely propagate the mask.
    > 2. For a masked array with structured dtype, I'd similarly say
    that the
    > default is for a mask to have the same shape as the array. But that
    > something like your collection makes sense for the case where one
    wants
    > to mask items in a structure.
Agreed that we should have a single bool per complex or structured
    element, and the mask shape is the same as the array shape. That's how I
    implemented it. But there is still a problem with complex.imag
    assignment:
    >>> a = MaskedArray([1j, 2, X])
        >>> i = a.imag
        >>> i[:] = MaskedArray([1, X, 1])
If we make the last line copy the mask to the original array, what
    should the real part of a[2] be? Conversely, if we don't copy the mask,
    what should the imag part of a[1] be? It seems like we might "want" the
    masks to be OR'd instead, but then should i[2] be masked after we just
    set it to 1?
Ah, I see the issue now... Easiest to implement and closest in analogy
to a regular view would be to just let it unmask a[2] (with whatever is
in real; user beware!).
Perhaps better would be to special-case such that `imag` returns a
read-only view of the mask. Making `imag` itself read-only would prevent
possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is
no reason this should update the mask.
Still, neither is really satisfactory...

> p.s. I started trying to implement the above "Mixin" class; will
    try to
    > clean that up a bit so that at least it uses `where` and push it up.
I played with "where", but didn't include it since 1.17 is not released.
    To avoid duplication of effort, I've attached a diff of what I tried. I
    actually get a slight slowdown of about 10% by using where...
Your implementation is indeed quite similar to what I got in
__array_ufunc__ (though one should "&" the where with ~mask).
I think the main benefit is not to presume that whatever is underneath
understands 0 or 1, i.e., avoid filling.
If you make progress with the mixin, a push is welcome. I imagine a
    problem is going to be that np.isscalar doesn't work to detect duck
    scalars.
I fear that in my attempts I've simply decided that only array scalars
exist...
-- Marten
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion