[Numpy-discussion] Re: ndarray.fill and ma.array.filled

Fri Apr 7 17:09:11 EDT 2006

Sasha wrote:

>On 4/7/06, Tim Hochberg <tim.hochberg at cox.net> wrote:
>  
>
>>[...]
>>
>>However, I do think the situation needs more thought. Slapping filled
>>and mask onto ndarray is the path of least resistance, but it's not
>>clear that it's the best one.
>>    
>>
>
>Completely agree.  I have many gripes about  current ma implementation
>of both "filled" and "mask".
>
>filled:
>
>1. I don't like default fill value.   It should  be mandatory to
>supply fill value.
>  
>
That makes perfect sense. If anything should have a default fill value, 
it's the functsion calling filled, not the arrays themselves.

>2. It should return masked array (with trivial mask), not ndarray.
>  
>
So, just with mask = False? In a follow on message Pierre disagress and 
claims that what you really want is the ndarray since not everything 
will accept.  Then I guess you'd need to call b.filled(fill).data. I 
agree with Sasha in principle but Pierre, perhaps in practice. I'm 
almost suggested it get renames a.asndarray(fill), except that asXXX has 
the wrong conotations. I think this one needs to bounce around some more.

>3. The name conflicts with the "fill" method.
>  
>
I thought you wanted to kill that. I'd certainly support that. Can't we 
just special case __setitem__ for that one case so that the performance 
is just as good if performance is really the issue?

>4. View/Copy inconsistency.  Does not provide a method to fill values in-place.
>  
>
b[b.mask] = fill_value; b.unmask()

seems to work for this purpose. Can we just have filled return a copy?

>mask:
>
>1. I've got rid of mask returning None in favor of False_ (boolean
>array scalar), but it is still not perfect.  I would prefer data.shape
>== mask.shape invariant and if space saving/performance  is deemed
>necessary use zero-stride arrays.
>  
>
Interesting idea. Is that feasible yet?

>2. I don't like the name. "Missing" or "na" would be better.
>  
>
I'm not on board here, although really I'd like to here from other 
people who use the package. 'na' seems to cryptic to me and 'missing' to 
specific -- there might be other reasons to mask a value other it being 
missing. The problem with mask is that it's not clear whether
True means the data is useful or unuseful. Keep throwing out names, 
maybe one will stick.

>  
>
>>If we do decide we are going to add both of these methods to ndarray
>>(with filled returning a copy!), then it may worth considering making
>>ndarray a subclass of MaskedArray. Conceptually this makes sense, since
>>at this point an ndarray will just be a MaskedArray where mask is always
>>False. I think that they could share  much of the implementation except
>>that ndarray would be set up to use methods that ignored the mask
>>attribute since they would know that it's always false. Even that might
>>not be worth it, since the check for whether mask is True/False is just
>>a pointer compare.
>>
>>    
>>
>
>The tail becoming the dog! Yet I agree, this makes sense from the
>implementation point of view.  From OOP perspective this would make
>sense if arrays were immutable, but since mask is settable in
>MaskedArray, making it constant in the subclass will violate the
>substitution principle.  I would not object making mask read only,
>however.
>  
>
How do you set the mask? I keep getting attribute errors when I try it. 
And unmask would be a noop on an ndarray.

>  
>
>>It may in fact be best just to do away with MaskedArray entirely, moving
>>the functionality into ndarray. That may have performance implications,
>>although I don't seem them at the moment, and I don't know if there are
>>other methods/attributes that this would imply need to be moved over,
>>although it looks like just mask, filled and possibly filled_value,
>>although the latter looks a little dubious to me.
>>
>>    
>>
>I think MA can coexist with ndarray and share the interface.  Ndarray
>can use special bit-patterns like IEEE NaN to indicate missing
>floating point values. Add-on modules can redefine arithmetic to make
>INT_MIN behave as a missing marker for signed integers (R, K and J (I
>think) languages use this approach).  Applications that need missing
>values support across the board will use MA.
>
>
>  
>
>>Either of the above two options would certainly improve the quality of
>>MaskedArray. Copy for instance seems not to have been implemented, and
>>who knows what other dark corners remain unexplored here.
>>
>>    
>>
>More (corners) than you want to know about! Reimplementing MA in C
>would be a worthwhile goal (and what you suggest seems to require just
>that), but it is too big of a project.  I suggest that we focus on the
>interface first.  If existing MA interface is rejected (which is
>likely) for ndarray, we can easily experiment with the alternatives
>within MA, which is pure python.
>  
>
Perhaps MaskedArray should inherit from ndarray for the time being. Many 
of the methods would need to reimplemented anyway, but it would make 
asanyarray work. Someone was just complaining about asarray munging his 
arrays. That's correct behaviour, but it would be nice if asanyarray did 
the right thing. I suppose we could just special case asanyarray to 
ignore MaskedArrays, that might be better since it's less constraining 
from an implementation side too.

>>There's a whole spectrum of possibilities here from ones that don't
>>intrude on ndarray at all to ones that profoundly change it. Sasha's
>>suggestion looks like it's probably the simplest thing in the short
>>term, but I don't know that it's the best long term solution. I think it
>>needs more thought and discussion, which is after all what Sasha asked
>>for ;)
>>    
>>
>
>Exactly!
>  
>
This may be an oportune time to propose something that's been cooking in 
the back of my head for a week or so now: A stripped down array 
superclass. The details of this are not at all locked down, but here's a 
strawman proposal.

    We add an array superclass. call it basearray, that has the same
    C-structure as the existing ndarray. However, it has *no* methods or
    attributes. It's simply a big blob of data. Functions that work on
    the C structure of arrays (ufuncs, etc) would still work on this
    arrays, as would asarray, so it could be converted to an ndarray as
    necessary. In addition, we would supply a minimal set of functions
    that would operate on this object. These functions would be chosen
    so that the current array interface could be implemented on top of
    them and the basearray object in pure python. These functions would
    be things like set_shape(a, shape), etc. They would be segregated
    off in their own namespace, not in the numpy core. [Note that I'm
    not proposing we actually implement ndarray this way, just that we
    make is possible]. This leads to several useful outcomes.
        1. If we're careful, this could be the basic array object that
    we propose, at least for the first roun,d for inclusion in the
    Python core. It's not useful for anything but passing data betwen
    various application that understand the data structure, but that in
    itself could be a huge win. And the fact that it's dirt simple would
    probably be an advantage to getting it into the core.
        2. It provides a useful marker class. MA could inherit from it
    (and use itself for it's data attribute) and then asanyarray would
    behave properly. MA could also use this, or a subclass, as the mask
    object preventing anyone from accidentally using it as data (they
    could always use it on purpose with asarray).
        3. It provides a platform for people to build other,
    ndarray-like classes in Pure python. This is my main interest. I've
    put together a thin shell over numpy that strips it down to it's
    abolute essentials including a stripped down version of ndarray that
    removes most of the methods. All of the __array_wrap__[1] stuff
    works quite well most of the time, but there's still some issues
    with being a subclass when this particular class is conceptually a
    superclass. If we had an array superclass of some sort, I believe
    that these would be resolved.

In principle at least, this shouldn't be that hard. I think it should 
mostly be rearanging some code and adding some wrappers to existing 
functions. That's in principle. In practice, I'm not certain yet as I 
haven't investigated the code in question in much depth yet. I've been 
meaning to write this up into a more fleshed out proposal, but I got 
distracted by the whole Protocol discussion on python-dev3000. This 
writeup is pretty weak, but hopefully you get the idea.

Anyway, this is somethig that I would be willing to put some time on 
that would benefit both me and probably the MA folks as well.

Regards,

-tim