[Numpy-discussion] masked and missing values, a historical perspective and some comments
Pierre GM
pgmdevlist at gmail.com
Wed May 16 06:22:51 EDT 2012
All,
I’ve been fairly quiet on the various missing/masked values proposals,
sorry about that. I figured I would take the opportunity of Travis O.’s
deadline of Wed. 16th to put some aspects of numpy.ma in a semi-historical,
semi-personal perspective. Apologies in advance if I misrepresent some
aspects, I count on our archivists to set the record straight when needed.
Let it be also known that I haven't had any time to play with the new NA
implementation...
Once upon a time...
-------------------
numpy.ma was developed on top of Paul Dubois’ implementation for Numeric
and its adaptation to Numarray. Originally a masked array was its own
object, the combination of a regular ndarray (the `.data`) and its `.mask`
counterpart, a boolean ndarray with the same shape as the `.data`. Two
special values were also defined, `nomask`, a flag in practice equal to
False indicating that no item was masked in the `.data` array and `masked`,
representing an item being missing or ignored in the `.data`. The `.mask`
had either `True` items (if the corresponding item in the `.data` was
missing or ignored, ie was `masked`) or `False` items otherwise. Nothing
surprising here.
I started to use numpy to analyze hydroclimatic datasets that are often
incomplete: for some reasons or another, data wasn’t recorded for some
period of time, and I wanted to take this aspect into account.
Maskedarrays-as-objects met my needs until I tried to create a subclass…
After unsuccessful attempts, doubtlessly due to my inexperience, I came to
the realization that most of the issues I was experiencing could perhaps be
solved by making the MaskedArray class a subclass of ndarray.
The new implementation followed the original one as much as possible, for
compatibility of course but also because I felt that there must have been
some pretty good reasons for why things were coded the way they were, if I
couldn’t exactly see why at the time. For example, the `nomask` flag, the
`fill value` concept, the shared mask approach were kept, as well as the
MaskedUnary/MaskedBinary/DomainBinaryOperations layers. On that regard,
actually, there have been some modifications. Initially, the `.mask` of the
output was computed first (from the `.mask` of the input(s) and a validity
domain, if any), and the non-masked part filled accordingly. This approach
worked well in most cases but turned out to be problematic in some domained
operations: finding the domain of np.pow in particular required too many
tests. Therefore, the most efficient (well, least inefficient) approach was
to compute first the output `.data` and then its mask from the input masks
and the invalid results. An ugly hack, yes, but which was deemed necessary
with a pure Python implementation.
Paul Dubois had confided during an email exchange that if things had been
different, he would have coded MaskedArrays in C. I would have done the
same thing if I could speak C (I'm still slowly learning it). I even
considered Cython for a while, but never managed to subclass a ndarray
(things may have changed now, it's been a while since I last tried).
<tl;dr>
I think that most of the current issues with numpy.ma come from the fact
that it's pure Python. The API itself hasn't really changed since the very
beginning of Numeric, even if it has been improved (e.g. to support
structured type) and the original masked-array-as-an-object concept has
been replaced by a subclass of ndarray. IMHO, most of the overhead would be
compensated by a lower-level implementation.
In any case, numpy.ma is currently in pure Python and therefore suboptimal.
For this reason, I always considered it mostly as a convenience module for
simple problems; for more elaborate ones, I advised to manipulate the
`.data` and `.mask` (actually, the `._data` and `._mask`...) independently,
at least till we come with a more efficient solution.
I'm not saying that the current concepts behind numpy.ma are perfect, by
far. There's no distinction between np.NA and np.IGNORED, for example, and
numpy.ma automatically mask invalid data (like INF or NAN), which may be
numpy.ma unmaintained ?
-----------------------
Chuck claimed [
http://article.gmane.org/gmane.comp.python.numeric.general/49815] that
numpy.ma is unmaintained. While I must recognized I haven't been able to
work on the bugs for the past 18 months, I'm hoping that my soon-to-be
future employer will leave me a bit of time to work on open-source projects
(anyway, if this summer's weather is like this spring's, I won't be
spending a lot of time in a park).
Nevertheless, the tickets I've seen so far don't look that scary and could
be fixed rather easily by anybody. The ticket about the printing of masked
arrays with a given precision is trickier, though (as the current
implementation relies on a trick to avoid having to redo everything).
MaskedArray by default? Yesss!
------------------------------
So far, I'm 100% with Chuck when he recently [
http://article.gmane.org/gmane.comp.python.numeric.general/49489] suggested
to transform all ndarrays in masked arrays. That was something I had
half-jokingly suggested at the time, because it would have made things so
much easier for me. Of course, most users never have to deal with missing
or ignored values, and forcing a ndarray to carry an extra mask is
counter-productive. Nevertheless, it shouldn't be as a problem, as Chuck
pointed out.
numpy.ma has already the `nomask` flag that is used to check whether there
are no missing/ignored data in the array (a nifty trick Paul Dubois
introduced to save some time)
>>> if marray.mask is nomask:
>>> (no missing values: do as if you have a basic ndarray)
>>> else:
>>> (some missing values: proceed as needed)
We could use the same scheme in C, with a flag indicating whether a mask is
defined. If the mask is not set, we could use the standard ufuncs. If it
is, we would use ufuncs modified to take a mask into account, or raise some
exception if they are not yet defined or could not be defined (e.g., FFT).
Introducing the flag would no doubt break some things downstream, but it
still looks like the easiest.
np.NA vs np.IGNORED / bitpattern vs mask
----------------------------------------
I think everybody recognizes that np.NA and np.IGNORED are different yet
complementary concepts. A recent post[
http://article.gmane.org/gmane.comp.python.numeric.general/49347]
illustrated a use of missing data for concentrations below a given limit.
Now imagine a series of concentrations recorded over a period of time where
(i) the captor failed to work (no point) (ii) the captor worked properly
but only traces were recorded (a point below limit). The first case (i)
would be np.NA, the second np.IGNORED. Should you perform some time series
statistics, you may want to be able to distinguish between the two.
I hope we all agree that np.NA is destructive, np.IGNORED is not.
I would expect to be able to assign values as such:
a[...] = np.NA <=> a[...].data=np.NA
a[...] = np.IGNORED <=> a[...].mask=True (numpy.ma convention)
That means that I could still 'ignore' a 'missing' value, by setting the
corresponding item in the `.mask` to the appropriate value (True w/ the
numpy.ma convention, False in Mark's...).
Finding where the np.NA are in an array could be done through a np.isna
function (analogous to np.isnan). What about the ignored values? Should we
just take the .mask? Do we need a np.isignored ?
Computationally, both np.NA and np.IGNORED could be lumped in a single
mask. We would then have to decide how to use this combined mask, which
brings the subject of propagation.
I disagree w/ Lluis' proposal [
http://article.gmane.org/gmane.comp.python.numeric.general/46728] to
consider destructiveness and propagation as orthogonal properties. I agree
with the following, though:
np.NA : propagating
np.IGNORED : non-propagating.
If a user wants to have a non-destructive but propagating flag, it'd be up
to her (give me an example where it could be useful).
On a value basis, I agree with
operation(np.NA, x) -> np.NA [eg, np.NA + x = np.NA]
operation(np.IGNORED, x) -> np.IGNORED [eg, np.IGNORED + x = np.IGNORED]
And now ?
---------
I think a first step would be to agree on what the behavior should be,
which would define a set of unit tests. Could we consider a wiki page
somewhere that would include polls, so that each of us could vote and/or
discuss specifics ?
Then, of course, there's the matter of implementation. I won't have a lot
of time to propose actual code for the next few weeks (couple of months),
so I'm not expecting my voice to weigh more than others…
I'd be glad to participate in civil discussions or to answer emails
off-list on numpy.ma related aspects and hopefully I'll be more active by
the end of the summer, beginning of the fall.
Cordially,
P.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120516/60305f2b/attachment.html>
More information about the NumPy-Discussion
mailing list