Tim Hochberg wrote:
Eric Firing wrote:
Sasha wrote:
On 4/7/06, *Tim Hochberg*
mailto:tim.hochberg@cox.net> wrote: ... In general, I'm skeptical of adding more methods to the ndarray object -- there are plenty already.
I've also proposed to drop "fill" in favor of optimizing x[...] = <scalar>. Having both "fill" and "filled" in the interface is plain awkward. You may like the combined proposal better because it does not change the total number of methods :-)
In addition, it appears that both the method and function versions of filled are "dangerous" in the sense that they sometimes return the array itself and sometimes a copy.
This is true in ma, but may certainly be changed.
Finally, changing ndarray to support masked array feels a bit like the tail wagging the dog.
I disagree. Numpy is pretty much alone among the array languages because it does not have "native" support for missing values. For the floating point types some rudimental support for nans exists, but is not really usable. There is no missing values machanism for integer types. I believe adding "filled" and maybe "mask" to ndarray (not necessarily under these names) could be a meaningful step towards "native" support for missing values.
I agree strongly with you, Sasha. I get the impression that the world of numerical computation is divided into those who work with idealized "data", where nothing is missing, and those who work with real observations, where there is always something missing.
I think your experience is clouding your judgement here. Or at least this comes off as unnecessarily perjorative. There's a large class of people who work with data that doesn't have missing values either because of the nature of data acquisition or because they're doing simulations. I take zillions of measurements with digital oscillopscopes and they *never* have missing values. Clipped values, yes, but even if I somehow could queery the scope about which values were actually clipped or simply make an educated guess based on their value, the facilities of ma would be useless to me. The clipped values are what I would want in any case. I also do a lot of work with simulations derived from this and other data. I don't come across missing values here but again, if I did, the way ma works would not help me. I'd have to treat them either by rejecting the data outright or by some sort of interpolation.
Tim, The point is well-taken, and I apologize. I stated my case badly. (I would be delighted if I did not have to be concerned with missing values-they are a pain regardless of how well a numerical package handles them.)
As an oceanographer, I am solidly in the latter category. If good support for missing values is not built in, it has to be bolted on, and it becomes clunky and awkward.
This may be a false dichotomy. It's certainly not obvious to me that this is so. At least if "bolted on" means "not adding a filled method to ndarray".
I probably overstated it, but I think we actually agree. I intended to lend support to the priority of making missing-value support as seamless and painless as possible. It will help some people, and not others.
I was reluctant to speak up about this earlier because I thought it was too much to ask of Travis when he was in the midst of putting numpy on solid ground. But I am delighted that missing value support has a champion among numpy developers, and I agree that now is the time to change it from "bolted on" to "integrated".
I have no objection to ma support improving. In fact I think it would be great although I don't forsee it helping me anytime soon. I also support Sasha's goal of being able to mix MaskedArrays and ndarrays reasonably seemlessly.
However, I do think the situation needs more thought. Slapping filled and mask onto ndarray is the path of least resistance, but it's not clear that it's the best one.
If we do decide we are going to add both of these methods to ndarray (with filled returning a copy!), then it may worth considering making ndarray a subclass of MaskedArray. Conceptually this makes sense, since at this point an ndarray will just be a MaskedArray where mask is always False. I think that they could share much of the implementation except that ndarray would be set up to use methods that ignored the mask attribute since they would know that it's always false. Even that might not be worth it, since the check for whether mask is True/False is just a pointer compare.
It may in fact be best just to do away with MaskedArray entirely, moving the functionality into ndarray. That may have performance implications, although I don't seem them at the moment, and I don't know if there are other methods/attributes that this would imply need to be moved over, although it looks like just mask, filled and possibly filled_value, although the latter looks a little dubious to me.
This is exactly the option that I was afraid to bring up because I thought it might be too disruptive, and because I am not contributing to numpy, and probably don't have the competence (or time) to do so.
Either of the above two options would certainly improve the quality of MaskedArray. Copy for instance seems not to have been implemented, and who knows what other dark corners remain unexplored here.
There's a whole spectrum of possibilities here from ones that don't intrude on ndarray at all to ones that profoundly change it. Sasha's suggestion looks like it's probably the simplest thing in the short term, but I don't know that it's the best long term solution. I think it needs more thought and discussion, which is after all what Sasha asked for ;)
Exactly! Thank you for broadening the discussion. Eric