[Numpy-discussion] Re: ndarray.fill and ma.array.filled
Tim Hochberg
tim.hochberg at cox.net
Fri Apr 7 17:09:11 EDT 2006
Sasha wrote:
>On 4/7/06, Tim Hochberg <tim.hochberg at cox.net> wrote:
>
>
>>[...]
>>
>>However, I do think the situation needs more thought. Slapping filled
>>and mask onto ndarray is the path of least resistance, but it's not
>>clear that it's the best one.
>>
>>
>
>Completely agree. I have many gripes about current ma implementation
>of both "filled" and "mask".
>
>filled:
>
>1. I don't like default fill value. It should be mandatory to
>supply fill value.
>
>
That makes perfect sense. If anything should have a default fill value,
it's the functsion calling filled, not the arrays themselves.
>2. It should return masked array (with trivial mask), not ndarray.
>
>
So, just with mask = False? In a follow on message Pierre disagress and
claims that what you really want is the ndarray since not everything
will accept. Then I guess you'd need to call b.filled(fill).data. I
agree with Sasha in principle but Pierre, perhaps in practice. I'm
almost suggested it get renames a.asndarray(fill), except that asXXX has
the wrong conotations. I think this one needs to bounce around some more.
>3. The name conflicts with the "fill" method.
>
>
I thought you wanted to kill that. I'd certainly support that. Can't we
just special case __setitem__ for that one case so that the performance
is just as good if performance is really the issue?
>4. View/Copy inconsistency. Does not provide a method to fill values in-place.
>
>
b[b.mask] = fill_value; b.unmask()
seems to work for this purpose. Can we just have filled return a copy?
>mask:
>
>1. I've got rid of mask returning None in favor of False_ (boolean
>array scalar), but it is still not perfect. I would prefer data.shape
>== mask.shape invariant and if space saving/performance is deemed
>necessary use zero-stride arrays.
>
>
Interesting idea. Is that feasible yet?
>2. I don't like the name. "Missing" or "na" would be better.
>
>
I'm not on board here, although really I'd like to here from other
people who use the package. 'na' seems to cryptic to me and 'missing' to
specific -- there might be other reasons to mask a value other it being
missing. The problem with mask is that it's not clear whether
True means the data is useful or unuseful. Keep throwing out names,
maybe one will stick.
>
>
>>If we do decide we are going to add both of these methods to ndarray
>>(with filled returning a copy!), then it may worth considering making
>>ndarray a subclass of MaskedArray. Conceptually this makes sense, since
>>at this point an ndarray will just be a MaskedArray where mask is always
>>False. I think that they could share much of the implementation except
>>that ndarray would be set up to use methods that ignored the mask
>>attribute since they would know that it's always false. Even that might
>>not be worth it, since the check for whether mask is True/False is just
>>a pointer compare.
>>
>>
>>
>
>The tail becoming the dog! Yet I agree, this makes sense from the
>implementation point of view. From OOP perspective this would make
>sense if arrays were immutable, but since mask is settable in
>MaskedArray, making it constant in the subclass will violate the
>substitution principle. I would not object making mask read only,
>however.
>
>
How do you set the mask? I keep getting attribute errors when I try it.
And unmask would be a noop on an ndarray.
>
>
>>It may in fact be best just to do away with MaskedArray entirely, moving
>>the functionality into ndarray. That may have performance implications,
>>although I don't seem them at the moment, and I don't know if there are
>>other methods/attributes that this would imply need to be moved over,
>>although it looks like just mask, filled and possibly filled_value,
>>although the latter looks a little dubious to me.
>>
>>
>>
>I think MA can coexist with ndarray and share the interface. Ndarray
>can use special bit-patterns like IEEE NaN to indicate missing
>floating point values. Add-on modules can redefine arithmetic to make
>INT_MIN behave as a missing marker for signed integers (R, K and J (I
>think) languages use this approach). Applications that need missing
>values support across the board will use MA.
>
>
>
>
>>Either of the above two options would certainly improve the quality of
>>MaskedArray. Copy for instance seems not to have been implemented, and
>>who knows what other dark corners remain unexplored here.
>>
>>
>>
>More (corners) than you want to know about! Reimplementing MA in C
>would be a worthwhile goal (and what you suggest seems to require just
>that), but it is too big of a project. I suggest that we focus on the
>interface first. If existing MA interface is rejected (which is
>likely) for ndarray, we can easily experiment with the alternatives
>within MA, which is pure python.
>
>
Perhaps MaskedArray should inherit from ndarray for the time being. Many
of the methods would need to reimplemented anyway, but it would make
asanyarray work. Someone was just complaining about asarray munging his
arrays. That's correct behaviour, but it would be nice if asanyarray did
the right thing. I suppose we could just special case asanyarray to
ignore MaskedArrays, that might be better since it's less constraining
from an implementation side too.
>>There's a whole spectrum of possibilities here from ones that don't
>>intrude on ndarray at all to ones that profoundly change it. Sasha's
>>suggestion looks like it's probably the simplest thing in the short
>>term, but I don't know that it's the best long term solution. I think it
>>needs more thought and discussion, which is after all what Sasha asked
>>for ;)
>>
>>
>
>Exactly!
>
>
This may be an oportune time to propose something that's been cooking in
the back of my head for a week or so now: A stripped down array
superclass. The details of this are not at all locked down, but here's a
strawman proposal.
We add an array superclass. call it basearray, that has the same
C-structure as the existing ndarray. However, it has *no* methods or
attributes. It's simply a big blob of data. Functions that work on
the C structure of arrays (ufuncs, etc) would still work on this
arrays, as would asarray, so it could be converted to an ndarray as
necessary. In addition, we would supply a minimal set of functions
that would operate on this object. These functions would be chosen
so that the current array interface could be implemented on top of
them and the basearray object in pure python. These functions would
be things like set_shape(a, shape), etc. They would be segregated
off in their own namespace, not in the numpy core. [Note that I'm
not proposing we actually implement ndarray this way, just that we
make is possible]. This leads to several useful outcomes.
1. If we're careful, this could be the basic array object that
we propose, at least for the first roun,d for inclusion in the
Python core. It's not useful for anything but passing data betwen
various application that understand the data structure, but that in
itself could be a huge win. And the fact that it's dirt simple would
probably be an advantage to getting it into the core.
2. It provides a useful marker class. MA could inherit from it
(and use itself for it's data attribute) and then asanyarray would
behave properly. MA could also use this, or a subclass, as the mask
object preventing anyone from accidentally using it as data (they
could always use it on purpose with asarray).
3. It provides a platform for people to build other,
ndarray-like classes in Pure python. This is my main interest. I've
put together a thin shell over numpy that strips it down to it's
abolute essentials including a stripped down version of ndarray that
removes most of the methods. All of the __array_wrap__[1] stuff
works quite well most of the time, but there's still some issues
with being a subclass when this particular class is conceptually a
superclass. If we had an array superclass of some sort, I believe
that these would be resolved.
In principle at least, this shouldn't be that hard. I think it should
mostly be rearanging some code and adding some wrappers to existing
functions. That's in principle. In practice, I'm not certain yet as I
haven't investigated the code in question in much depth yet. I've been
meaning to write this up into a more fleshed out proposal, but I got
distracted by the whole Protocol discussion on python-dev3000. This
writeup is pretty weak, but hopefully you get the idea.
Anyway, this is somethig that I would be willing to put some time on
that would benefit both me and probably the MA folks as well.
Regards,
-tim
More information about the NumPy-Discussion
mailing list