[Numpy-discussion] Separating out the maskna code

Nathaniel Smith njs at pobox.com
Sat May 19 16:54:38 EDT 2012

Hi all,

Since Mark's original missingdata branch made so many changes, I
figured it would be a useful exercise to figure out what code in
master is actually related to masked arrays, and which isn't. The
easiest way seemed to be to delete the new fields, then keep removing
any code that depended on them until everything built again, which
gives this:

Possible uses:
  - Use the diff as a checklist for going through to change the API
  - Use the diff as a checklist for adding an experimental-API-only flag
  - Merge into master, then use as a reference to cherrypick the
pieces that we want to save (there is some questionable stuff in here
-- e.g. copy/paste code, hand-wavy half-implementation of "multi-NA"
support, and PyArray_ReduceWrapper, see below)
  - Merge into master and then immediately 'git revert' the changes on
a branch, which would effectively 'move it aside' so we can release
1.7 while Mark and Travis to continue hacking on it at their own pace

This is a huge patch, but I was pretty careful not to cause any
accidental non-maskna-related regressions. The whole PyArray_Diagonal
thread actually happened because I noticed that test_maskna had the
only tests for np.diagonal, so I wanted to write proper ones that
would be independent of the maskna code. Also I ran the following
  - numpy.test("full")
  - scipy v0.10.1, test("full")
  - matplotlib current master
  - pandas v0.7.3
and everything looks good.

The other complicated thing to handle was the new
PyArray_ReduceWrapper function that was added to the public multiarray
API. Conceptually, this function has only a glancing relationship to
masked arrays per se. But, it has its own problems, and I don't think
it should be exposed in 1.7. Partly because its signature necessarily
changes depending on whether maskna support exists. Partly because
it's just kind of ugly (it has 15 arguments, after I removed some[1]).
But mostly because it gives us two independent generic implementations
of functions that do array->scalar operations, which seems like
something we absolutely don't want to commit to supporting
indefinitely. And the "generalized ufunc" alternative seems a lot more
promising. So, that branch also has a followup patch that does the
necessary hacking to get it out of the public API.

Unfortunately, this patch is dependent on the previous one -- I'm not
sure how to untangle PyArray_ReduceWrapper while keeping the maskna
support in, which makes the "global experimental flag" idea for 1.7
hard to implement (assuming that others agree about
PyArray_ReduceWrapper being unready for public exposure).

At this point, it might easiest to just merge this branch to master,
immediately revert it on a branch for Mark and Travis to work on, and
then release 1.7.

Ralf, IIUC merging this and my other outstanding PRs would leave the
datetime issues on python3/win32 as the only outstanding blocker?

- N

[1] http://www-pu.informatik.uni-tuebingen.de/users/klaeren/epigrams.html
, number 11

------------ Commit messages follow for reference/discussion

The main change is commit 4c16813c23b20:
    Remove maskna API from ndarray, and all (and only) the code supporting it

    The original masked-NA-NEP branch contained a large number of changes
    in addition to the core NA support. For example:
     - ufunc.__call__ support for where= argument
     - nditer support for arbitrary masks (in support of where=)
     - ufunc.reduce support for simultaneous reduction over multiple axes
     - a new "array assignment API"
     - ndarray.diagonal() returning a view in all cases
     - bug-fixes in __array_priority__ handling
     - datetime test changes
    etc. There's no consensus yet on what should be done with the
    maskna-related part of this branch, but the rest is generally useful
    and uncontroversial, so the goal of this branch is to identify exactly
    which code changes are involved in maskna support.

    The basic strategy used to create this patch was:
     - Remove the new masking-related fields from ndarray, so no arrays
       are masked
     - Go through and remove all the code that this makes
       dead/inaccessible/irrelevant, in a largely mechanical fashion. So
       for example, if I saw 'if (PyArray_HASMASK(a)) { ... }' then that
       whole block was obviously just dead code if no arrays have masks,
       and I removed it. Likewise for function arguments like skipna that
       are useless if there aren't any NAs to skip.

    This changed the signature of a number of functions that were newly
    exposed in the numpy public API. I've removed all such functions from
    the public API, since releasing them with the NA-less signature in 1.7
    would create pointless compatibility hassles later if and when we add
    back the NA-related functionality. Most such functions are removed by
    this commit; the exception is PyArray_ReduceWrapper, which requires
    more extensive surgery, and will be handled in followup commits.

    I also removed the new ndarray.setasflat method. Reason: a comment
    noted that the only reason this was added was to allow easier testing
    of one branch of PyArray_CopyAsFlat. That branch is now the main
    branch, so that isn't an issue. Nonetheless this function is arguably
    useful, so perhaps it should have remained, but I judged that since
    numpy's API is already hairier than we would like, it's not a good
    idea to add extra hair "just in case". (Also AFAICT the test for this
    method in test_maskna was actually incorrect, as noted here:
    so I'm not confident that it ever worked in master, though I haven't
    had a chance to follow-up on this.)

    I also removed numpy.count_reduce_items, since without skipna it
    became trivial.

    I believe that these are the only exceptions to the "remove dead code"

The ReduceWrapper untangling is a001fb29c9:
    Remove PyArray_ReduceWrapper from public API

    There are two reasons to want to keep PyArray_ReduceWrapper out of the
    public multiarray API:
     - Its signature is likely to change if/when masked arrays are added
     - It is essentially a wrapper for array->scalar transformations
       (*not* just reductions as its name implies -- the whole reason it
       is in multiarray.so in the first place is to support count_nonzero,
       which is not actually a reduction!). It provides some nice
       conveniences (like making it easy to apply such functions to
       multiple axes simultaneously), but, we already have a general
       mechanism for writing array->scalar transformations -- generalized
       ufuncs. We do not want to have two independent, redundant
       implementations of this functionality, one in multiarray and one in
       umath! So in the long run we should add these nice features to the
       generalized ufunc machinery. And in the short run, we shouldn't add
       it to the public API and commit ourselves to supporting it.

    However, simply removing it from numpy_api.py is not easy, because
    this code was used in both multiarray and umath. This commit:
     - Moves ReduceWrapper and supporting code to umath/, and makes
       appropriate changes (e.g. renaming it to PyUFunc_ReduceWrapper and
       cleaning up the header files).
     - Reverts numpy.count_nonzero to its previous implementation, so that
       it loses the new axis= and keepdims= arguments. This is
       unfortunate, but this change isn't so urgent that it's worth tying
       our APIs in knots forever. (Perhaps in the future it can become a
       generalized ufunc.)

More information about the NumPy-Discussion mailing list