Re: [Numpy-discussion] Re: ndarray.fill and ma.array.filled

7 Apr 2006

      Tim Hochberg wrote:
...
Eric Firing wrote:
...
Sasha wrote:
...
On 4/7/06, *Tim Hochberg* mailto:tim.hochberg@cox.net> wrote:
...
    In general, I'm skeptical of adding more methods to the ndarray 
object
    -- there are plenty already.
I've also proposed to drop "fill" in favor of optimizing x[...] = 
<scalar>.  Having both "fill" and "filled" in the interface is plain 
awkward.  You may like the combined proposal better because it does 
not change the total number of methods :-)
In addition, it appears that both the method and function 
versions of
    filled are "dangerous" in the sense that they sometimes return the
    array
    itself and sometimes a copy.
This is true in ma, but may certainly be changed.
Finally, changing ndarray to support masked array feels a bit 
like the
    tail wagging the dog.
I disagree. Numpy is pretty much alone among the array languages 
because it does not have "native" support for missing values. For 
the  floating point types some rudimental support for nans exists, 
but is not really usable.  There is no missing values machanism for 
integer types.  I believe adding "filled" and maybe "mask" to ndarray 
(not necessarily under these names) could be a meaningful step 
towards "native" support for missing values.
I agree strongly with you, Sasha.  I get the impression that the world 
of numerical computation is divided into those who work with idealized 
"data", where nothing is missing, and those who work with real 
observations, where there is always something missing.
I think your experience is clouding your judgement here. Or at least 
this comes off as unnecessarily perjorative. There's a large class of 
people who work with data that doesn't have missing values either 
because of the nature of data acquisition or because they're doing 
simulations. I take zillions of measurements with digital oscillopscopes 
and they *never* have missing values. Clipped values, yes, but even if I 
somehow could queery the scope about which values were actually clipped 
or simply make an educated guess based on their value, the facilities of 
ma would be useless to me. The clipped values are what I would want in 
any case.  I also do a lot of work with simulations derived from this 
and other data. I don't come across missing values here but again, if I 
did, the way ma works would not help me. I'd have to treat them either 
by rejecting the data outright or by some sort of interpolation.
Tim,

The point is well-taken, and I apologize.  I stated my case badly.  (I 
would be delighted if I did not have to be concerned with missing 
values-they are a pain regardless of how well a numerical package 
handles them.)
...
...
As an oceanographer, I am solidly in the latter category.  If good 
support for missing values is not built in, it has to be bolted on, 
and it becomes clunky and awkward.
This may be a false dichotomy. It's certainly not obvious to me that 
this is so. At least if "bolted on" means "not adding a filled method to 
ndarray".
I probably overstated it, but I think we actually agree.  I intended to 
lend support to the priority of making missing-value support as seamless 
and painless as possible.  It will help some people, and not others.
...
...
I was reluctant to speak up about this earlier because I thought it 
was too much to ask of Travis when he was in the midst of putting 
numpy on solid ground.  But I am delighted that missing value support 
has a champion among numpy developers, and I agree that now is the 
time to change it from "bolted on" to "integrated".
I have no objection to ma support improving. In fact I think it would be 
great although I don't forsee it helping me anytime soon. I also support 
Sasha's goal of being able to mix  MaskedArrays and ndarrays reasonably 
seemlessly.
However, I do think the situation needs more thought. Slapping filled 
and mask onto ndarray is the path of least resistance, but it's not 
clear that it's the best one.
If we do decide we are going to add both of these methods to ndarray 
(with filled returning a copy!), then it may worth considering making 
ndarray a subclass of MaskedArray. Conceptually this makes sense, since 
at this point an ndarray will just be a MaskedArray where mask is always 
False. I think that they could share  much of the implementation except 
that ndarray would be set up to use methods that ignored the mask 
attribute since they would know that it's always false. Even that might 
not be worth it, since the check for whether mask is True/False is just 
a pointer compare.
It may in fact be best just to do away with MaskedArray entirely, moving 
the functionality into ndarray. That may have performance implications, 
although I don't seem them at the moment, and I don't know if there are 
other methods/attributes that this would imply need to be moved over, 
although it looks like just mask, filled and possibly filled_value, 
although the latter looks a little dubious to me.
This is exactly the option that I was afraid to bring up because I 
thought it might be too disruptive, and because I am not contributing to 
numpy, and probably don't have the competence (or time) to do so.
...
Either of the above two options would certainly improve the quality of 
MaskedArray. Copy for instance seems not to have been implemented, and 
who knows what other dark corners remain unexplored here.
There's a whole spectrum of possibilities here from ones that don't 
intrude on ndarray at all to ones that profoundly change it. Sasha's 
suggestion looks like it's probably the simplest thing in the short 
term, but I don't know that it's the best long term solution. I think it 
needs more thought and discussion, which is after all what Sasha asked 
for ;)
Exactly!  Thank you for broadening the discussion.

Eric