[Numpy-discussion] new MaskedArray class

Mon Jun 17 18:43:05 EDT 2019

Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using
`__array_function__`, which I got mostly working. It seems like a good
time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a
pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
What is missing? It needs even more tests for new functionality, and a
couple numpy-API functions are missing, in particular `np.median`,
`np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying
the logic by always storing the mask in full, it also has new features.
For instance it allows the use of a "X" variable to mark masked
locations during array construction, and I solve the issue of how to
mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and
what seems good or bad. If it seems like a good start, I'd be happy to
move it into a numpy repo of some sort for further collaboration &
discussion, and maybe into 1.18. At the least I hope it can serve as a
design study of what we could do.

Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays,
and similarly about field-assignment of structured arrays. The problem
is that we have a single mask bool per element of a complex array, but
what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using
`__array_function__` which I think many ducktype implementors will have
to face. For MaskedArray, I created an associated "MaskedScalar" type.
However, MaskedScalar has to behave differently from normal numpy
scalars in a number of ways: It is not part of the numpy scalar
hierarchy, it fails checks `isinstance(var, np.floating)`, and
np.isscalar returns false. Numpy scalar types cannot be subclassed. We
have discussed before the need to have distinction between 0d-arrays and
scalars, so we shouldn't just use a 0d (in fact, this makes printing
very difficult). This leads me to think that in future dtype-overhaul
plans, we should consider creating a subclassable `np.scalar` base type
to wrap all numpy scalar variables, and code like `isinstance(var,
np.floating)` should be replaced by `isinstance(var.dtype.type,
np.floating)` or similar. That is, the numeric dtype of the scalar is no
longer encoded in `type(var)` but in `var.dtype`: The fact that the
variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars
in common with arrays which have nothing to do with the dtype, which
ducktype implementors want to touch. I imagine this will come up a lot:
In that repo I also have an "ArrayCollection" ducktype which required a
"CollectionScalar" scalar, and similarly I imagine people implementing
units want the units attached to the scalar, independently of the dtype.

Cheers,
Allan