[Numpy-discussion] Re: ndarray.fill and ma.array.filled

Fri Apr 7 15:38:04 EDT 2006

On 4/7/06, Tim Hochberg <tim.hochberg at cox.net> wrote:
> [...]
>
> However, I do think the situation needs more thought. Slapping filled
> and mask onto ndarray is the path of least resistance, but it's not
> clear that it's the best one.

Completely agree.  I have many gripes about  current ma implementation
of both "filled" and "mask".

filled:

1. I don't like default fill value.   It should  be mandatory to
supply fill value.
2. It should return masked array (with trivial mask), not ndarray.
3. The name conflicts with the "fill" method.
4. View/Copy inconsistency.  Does not provide a method to fill values in-place.

mask:

1. I've got rid of mask returning None in favor of False_ (boolean
array scalar), but it is still not perfect.  I would prefer data.shape
== mask.shape invariant and if space saving/performance  is deemed
necessary use zero-stride arrays.

2. I don't like the name. "Missing" or "na" would be better.

> If we do decide we are going to add both of these methods to ndarray
> (with filled returning a copy!), then it may worth considering making
> ndarray a subclass of MaskedArray. Conceptually this makes sense, since
> at this point an ndarray will just be a MaskedArray where mask is always
> False. I think that they could share  much of the implementation except
> that ndarray would be set up to use methods that ignored the mask
> attribute since they would know that it's always false. Even that might
> not be worth it, since the check for whether mask is True/False is just
> a pointer compare.
>

The tail becoming the dog! Yet I agree, this makes sense from the
implementation point of view.  From OOP perspective this would make
sense if arrays were immutable, but since mask is settable in
MaskedArray, making it constant in the subclass will violate the
substitution principle.  I would not object making mask read only,
however.

> It may in fact be best just to do away with MaskedArray entirely, moving
> the functionality into ndarray. That may have performance implications,
> although I don't seem them at the moment, and I don't know if there are
> other methods/attributes that this would imply need to be moved over,
> although it looks like just mask, filled and possibly filled_value,
> although the latter looks a little dubious to me.
>
I think MA can coexist with ndarray and share the interface.  Ndarray
can use special bit-patterns like IEEE NaN to indicate missing
floating point values. Add-on modules can redefine arithmetic to make
INT_MIN behave as a missing marker for signed integers (R, K and J (I
think) languages use this approach).  Applications that need missing
values support across the board will use MA.

> Either of the above two options would certainly improve the quality of
> MaskedArray. Copy for instance seems not to have been implemented, and
> who knows what other dark corners remain unexplored here.
>
More (corners) than you want to know about! Reimplementing MA in C
would be a worthwhile goal (and what you suggest seems to require just
that), but it is too big of a project.  I suggest that we focus on the
interface first.  If existing MA interface is rejected (which is
likely) for ndarray, we can easily experiment with the alternatives
within MA, which is pure python.

> There's a whole spectrum of possibilities here from ones that don't
> intrude on ndarray at all to ones that profoundly change it. Sasha's
> suggestion looks like it's probably the simplest thing in the short
> term, but I don't know that it's the best long term solution. I think it
> needs more thought and discussion, which is after all what Sasha asked
> for ;)

Exactly!