![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
In an ideal world, any function that accepts ndarray would accept ma.array and vice versa. Moreover, if the ma.array has no masked elements and the same data as ndarray, the result should be the same. Obviously current implementation falls short of this goal, but there is one feature that seems to make this goal unachievable. This feature is the "filled" method of ma.array. Pydoc for this method reports the following: | filled(self, fill_value=None) | A numeric array with masked values filled. If fill_value is None, | use self.fill_value(). | | If mask is nomask, copy data only if not contiguous. | Result is always a contiguous, numeric array. | # Is contiguous really necessary now? That is not the best possible description ("filled" is "filled"), but the essence is that the result of a.filled(value) is a contiguous ndarray obtained from the masked array by copying non-masked elements and using value for masked values. I would like to propose to add a "filled" method to ndarray. I see several possibilities and would like to hear your opinion: 1. Make filled simply return self. 2. Make filled return a contiguous copy. 3. Make filled replace nans with the fill_value if array is of floating point type. Unfortunately, adding "filled" will result is a rather confusing situation where "fill" and "filled" both exist and have very different meanings. I would like to note that "fill" is a somewhat odd ndarray method. AFAICT, it is the only non-special method that mutates the array. It appears to be just a performance trick: the same result can be achived with "a[...] = ".
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Sasha wrote:
This would be *very* nice.
It seems to me that any function or method that returns an array from an array should be perfectly consistent and explicit about whether it makes a copy or not. Sometimes the filled method *needs* to return a copy; therefore it should *always* return a copy, regardless of the presence or state of masking. Hence I think the filled method of ma needs to be changed in this way also. The question for your suggestion 3 is, should a nan always be the equivalent of a masked value? One loses a little flexibility, but it has an appealing simplicity to it. I could be persuaded otherwise, but right now I would vote for it. Eric
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tim makes a good point here. Should the reshape method be fixed to always return a copy? The semantics a.shape = (...) could still be used to re-shape contiguous arrays where possible. However, whether or not reshape returns a copy is consistent (but perhaps not explicitly explained). We will still have .ravel() which sometimes copies and sometimes doesn't. -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/22/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Reshape never copies the data:
The only inconsistency is that
I agree that this is unnecessary, but don't see much of a problem. +0 here
However, whether or not reshape returns a copy is consistent (but perhaps not explicitly explained).
To me consistency means "is independent of the input." Whether or not reshape creates a new python object depends on the value of the argument. I would call it inconsistency.
We will still have .ravel() which sometimes copies and sometimes doesn't.
Ravel should be a shortcut for x.reshape((x.size,)), so it is really the same question. +0 (to make ravel always return a new python object)
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
And so articulately stated too ;)
My opinion is that all methods and functions should either: 1. Always return a copy. 2. Always return a view 3. Return a view if possible otherwise raise an exception. So, like Sasha, I'd like to see ravel changed as well. I don't really care if it's to 1 or 3 though. -tim
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Well, but is copy/view the /only/ invariant worth guaranteeing? I think there is a valid need for functions which ensure other invariants, such as contiguity. There are applications (such as passing pointers to C/Fortran libraries which don't have striding mechanisms but will not modify their inputs) which require contiguous inputs, but where one would rather make a copy only if necessary. My take on this is that we should /document/ clearly what invariants any given function satisfies, but I think the 'always view/always copy' view excludes an important usage case. There may be others beyond contiguity, but that's the one that pops immediately to mind. Cheers, f
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
This is a different case, I think. The result of this copy is not user visible except in terms of performance. I'm only concerned with functions that *return* copies or views depending on the input. I don't care if a function sometimes makes a copy under the covers but doesn't return it.
I don't think we're in disagreement here although I'm not sure. I will add, on the subject of continuity, that I think there should be a function 'ascontiguous' that parallels asarray, but assures that the result is contiguous. Although this sometimes returns a copy, I thinks that's OK since that's it's job. I would like to see all of the implicit copying pushed into functions like asarray and ascontiguous. This also helps efficiency. Imagine I have some calls to functions that require contiguous arrays and do copies under the covers if their args are not contiguous. In that case: a = ascontiguous(a) param1 = computeSomethingOnContiguousData(a) param2 = computeSomethingElseOnContiguousData(a) # etc. Will be much more efficient than the equivalent code without the ascontiguous when the initial a value is not discontiguous. Regards, -tim
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
I think we agree: something like ascontiguous() is precisely what I had in mind (I think that's what ravel() does today, but I'm fine if it gets a new name, as long as the functionality exists). Obviously a function like this should explicitly (docstring) say that it does NOT make any guarantees about whether its return value is a view or a copy, just that it's contiguous. Cheers, f
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
a.ravel() seems to be equivalent to reshape(a, [-1]). That is, it returns a flattened, contiguous copy. ascontiguous(a) would be slightly different in that it would preserves the shape of a. In fact I think it would look a lot like: def ascontiguous(a): """ascontiguous(a) -> contiguous representation of a. If 'a' is allready contiguous, it is returned unchanged. Otherwise, a contiguous copy is returned. """ a = asarray(a) if not a.flags['CONTIGUOUS']: a = array(a) return a
I agree. Regards, -tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
I was just looking at the interface for array and asarray to see what other stuff should go in the interface of the hypothetical ascontiguous. There's 'dtype', which I knew about, and 'fortran', which I didn't, but which makes sense. However, there's also 'ndmin'. First off, it's not described in docstring for asarray, but I was able to find it in the docstring for array without a problem. Second, is it really necessary? It seems to be useful in an awfully narrow set of circumstances, particularly since when you are padding axes not everyone wants to pad to the left. It would seem to be more useful to ditch the ndmin and have some sort of paddims function that was more full featured (padding to either the left or the right at a minimum). I'm not entirely sure what the best interface to such a beast would look like, but a simple tactic would be to just provide leftpaddims and rightpaddims. If it's not allready clear by now (;), I prefer several narrow interfaces to one broad one. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Padding to the left is "default" behavior for broadcasting and so it seems appropriate. This is how all lower-dimensional arrays are interpreted as "higher" dimensional arrays throughout the code. The ndmin is very handy as atested to by the uses of atleast_1d or atleast_2d in numpy library code. It was added later as an optimization step because of the number of library routines that were using it. I've since used it several times to simplify code. I think an ascontinguous on the Python level is appropriate since such a beast exists on the C-level. On the other hand, while Tim prefers narrow interfaces, the array_from_object interface is traditionally pretty broad. Thus, in my mind, the array call should get another flag keyword that forces a contiguous result. This translates easily to the C-domain, in much the same way as the fortran keyword does. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
That makes some sense.
OK, I'll take your word for it.
This doesn't bother me since I long ago gave up any hope that the array constructor would have a narrow interface.
This translates easily to the C-domain, in much the same way as the fortran keyword does.
I'll buy that. While I accept array() needs a wide interface, I still prefer to keep as many other interfaces as possible narrow. In particular, is ndmin widely used in asarray? Or do the library routines generally use array instead. Given the choice I'd sweep as much of the dust, AKA wideness, into array90 as possible since that's irredeemably wide anyway and keep the other interfaces as narrowly focused as possible. Put another way, asarray and ascontiguous are about clarity of intent. With too much extra baggage, the intent becomes obscured. The coupling seems tight enough for dtype and fortran, but once you get to ndmin, it seems that you might as well go with the big guns and break out "array(x, copy=False, ndmin=n)". That's my $0.02 on this subject and I'll shut up about it now. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Not necessarily sane advice :-) --- I might be overstating things. I know that atleast_1d and atleast_2d are used all over the place in scipy. This makes sense and I can certainly understand it. I'm willing to modify things to give narrow interfaces. Right now, since the requesting both fortran and contiguous does not make sense, setting the fortran flag to false enforces C-style contiguous while setting it to True enforces fortran-style. Setting it to None (the default) specifies you don't care and the behavior will be to create C-style contiguous for new arrays and use the striding specified by the array if it's already an array. I admit that it is arguable whether or not the fortran flag should be overloaded like this or not. There are now ascontiguous and asfortran functions with fairly minimal interfaces to make it simpler. There is also a check to make sure that array is called with no more than 2 non keyword arguments. Thus, you won't be able to confuse which flag is which. -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/23/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Thus, in my mind, the array call should get another flag keyword that forces a contiguous result.
Please don't! The fortran flag is bad enough, but has too much history behind it. Let's not breed boolean parameters. Sooner or later someone will use keword arguments positionally and you will end up guessing what array([1,2], int8_, 1, 1, 0, 0) means.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Sasha wrote:
There are several boolean flags in the interface already. Adding another one won't change the current situation that you describe. There are several ways to handle this. For one, we could force the use of keyword arguments, so that the position problem does not arise. Sasha has mentioned in the past a strides array argument, but I think the default fortran and contiguous strides cases need better support then just one of many possible stridings so I wouldn't go that direction here. I'm debating whether or not the fortran flag should be used to specify both contiguous and fortran cases. Right now, the fortran argument is a three-case flag with dont-care, True, and False arguments. It seems natural to have True mean force-fortran and False mean force-contiguous with dont-care (the default) mean take an array already given (or create a C-contiguous array if we are generating a new array from another object). At any rate, if the fortran flag is there, we need to specify the contiguous case as well. So, either propose a better interface (we could change it still --- the fortran flag doesn't have that much history) to handle the situation or accept what I do ;-) -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/23/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Let me try. I propose to eliminate the fortran flag in favor of a more general "strides" argument. This argument can be either a sequence of integers that becomes the strides, or a callable object that takes shape and dtype arguments and return a sequence that becomes the strides. For fortran and c order functions that generate appropriate stride sequences should be predefined to enable array(..., strides=fortran, ...) and array(..., strides=contiguous).
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
I like the idea of being able to create an array with custom strides. The applications aren't entirely clear yet, but it does seem like it could have some interesting and useful consequences. That said, I don't think this belongs in 'array'. Historically, array has been used for all sorts of array creation activities, which is why it always seems to have a wide, somewhat incoherent interface. However, most uses of array() boil down to one thing: creating a *new* array from a python object. My preference would be to focus on that functionality for array() and spin of it's other historical uses and new uses, like this custom strided array stuff, into separate factory functions. For example (and just for example, I make no great claims for either this name or interface): a = array_from_data(a_buffer_object, dtype, dims, strides) One thing that you do make clear is that contiguous and fortran should really two values of the same flag. If you combine this with one other simplification: array() always copies, we end up with a nice thin interface: # Create a new array in 'order' order. Defaults to "C" order. array(object, dtype=None, order="C"|"FORTRAN") and # Returns an array. If object is an array and order is satisfied, return object otherwise a new array. # If order is set the returned array will be contiguous and have that ordering asarray(object, dtype=None, order=None|"C"|"FORTRAN") # Just the same, but allow subtypes. asanyarray(object, dtype=None, order=None|"C"|"FORTRAN") You could build asarray, asanyarray, etc on top of the proposed array without problems by using type(object)==ndarray and isinstance(type, ndarray) respectively. Stuff like convenience functions for minnd would also be easy to build on top of there. This looks great to me (pre-coffee). Embrace simplicity: you have nothing to lose but your clutter;) Regards, -tim
![](https://secure.gravatar.com/avatar/b24e93182e89a519546baa7bafe054ed.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Please see the transpose example above.
I feel that [***] above is much cleaner than this. I suggest that string constants be deprecated.
If [***] above were adopted, it would still be helpful to adopt numarray's iscontiguous method, or better, use a property. colin W.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Colin J. Williams wrote:
This is true, but irrelevant. To the best of my knowledge, the only reason to force an array to be in a specific order is to pass it to a C function that expects either FORTRAN- or C-ordered arrays. And, in that case, the array also needs to be contiguous. So, for the purpose of creating arrays (and for the purposes of ascontiguous), the only cases that matter are arrays that are both contiguous and the specified order. Thus, specifying continuity and order separately to the constructor needlessly complicates the interface. Or since I'm feeling jargon happy today, YAGNI.
I'm no huge fan of string constants myself, but I think you need to think this through more. First off, the interface I tossed off above doesn't cover the same ground as array, since it works off an already created buffer object. That means you'd have to go through all sorts of contortions and do at least one copy to get data into Fortran order. You could allow arbitrary, 1D, python sequences instead, but that doesn't help the common case of converting a 2D python object into a 2D array. You could allow N-D python objects, but then you have two ways of specifying the dims of the object and things become a big krufty mess. Compared to that string constants are great.
-0. In my experience, 99% of my use cases would be covered for ascontiguous and for the remaining 1% I'm happy to use a.flags.contiguous. Regards, -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Removing the copy flag will break a lot of code because it's been around for a long time. This is also not an "easy thing" to add to convertcode.py though I suppose 90% of the cases could be found. We would also have to re-write asarray to be an additional C-function to make it not copy but make array copy. So, for now I'm not as enthused about that idea. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
Great!
I kinda figured on that. But I figured I'd propose my favorite and see what came of it.
We would also have to re-write asarray to be an additional C-function to make it not copy but make array copy.
I thought so too at first, but I don't this is is so. Untested, and can could probably be cleaned up some: def asarray(obj, order=None): if type(obj) == ndarray: if order: if not obj.flags.contiguous: return array(obj, order) if order == "C" and obj.flags.fortran: return array(obj, order) if order == "FORTRAN" and not obj.flags.fortran: return array(obj, order) return obj else: if order: return array(obj, order) else: return array(obj) For asanyarray, simply replace the type test with an isinstance test.
So, for now I'm not as enthused about that idea.
Yeah. Without backward compatibility constraints I'm convinced that it's the right thing to do, but I realize there is a need to balance making the transistion manageable with making things "perfect". -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
And this is now done... So, thankfully, the fortran= keyword is gone and replaced with the more sensible order= keyword. Tests for numpy pass, but any other code that used fortran= will need changing. Sorry about that... Thanks, -Travis
![](https://secure.gravatar.com/avatar/25ef0a6698317c91220d6a1a89543df3.jpg?s=120&d=mm&r=g)
On 3/24/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
This looks very similar to the current ndarray "new" constructor: | ndarray.__new__(subtype, shape=, dtype=int_, buffer=None, | offset=0, strides=None, fortran=False) | | There are two modes of creating an array using __new__: | 1) If buffer is None, then only shape, dtype, and fortran | are used | 2) If buffer is an object exporting the buffer interface, then | all keywords are interpreted. | The dtype parameter can be any object that can be interpreted | as a numpy.dtype object. (see pydoc numpy.ndarray) I would not mind to leave array() unchanged and move discussion to streamlining ndarray.__new__ . For example, some time ago I suggested that strides should be interpreted even if buffer=None.
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/24/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
This looks very similar to the current ndarray "new" constructor: | ndarray.__new__(subtype, shape=, dtype=int_, buffer=None, | offset=0, strides=None, fortran=False) | | There are two modes of creating an array using __new__: | 1) If buffer is None, then only shape, dtype, and fortran | are used | 2) If buffer is an object exporting the buffer interface, then | all keywords are interpreted. | The dtype parameter can be any object that can be interpreted | as a numpy.dtype object. (see pydoc numpy.ndarray) I would not mind to leave array() unchanged and move discussion to streamlining ndarray.__new__ . For example, some time ago I suggested that strides should be interpreted even if buffer=None.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Alexander Belopolsky wrote:
It sure does.
That does look like a good place to hang any arbitrary strides creating stuff. I'm of the opinion that arguments should *never* be ignored, so I'm all for interpreting strides even when buffer is None. I'd also contend that offset should either be respected (by overallocating) or since that's probably useless, raising a ValueError when it's nonzero. Regards, Tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
We could always force keyword parameter by using **kwd. That would mess mess up help() and other introspection tools. Then again, if we want to discourage overuse of this sort of stuff, perhaps that's not a bad thing ;) But if we really need access to the C guts of an array, just allow a dictionary of flags to get plugged in. These would be the same as what somearray.flags puts out. The new interface for array would be: array(object, **flags) Where flags could contain: CONTIGUOUS : force a copy if object isn't contiguous if True. [default None] FORTRAN : force array to be fortran order if True, C order if False [default None] OWNDATA : force a copy if True [default True] WRITEABLE : Force a copy if object isn't writeable if Tru [default None] ALIGNED : Force a copy if object isn't aligned if True. [default None] UPDATEIFCOPY : Set the UPDATEIFCOPY flag? [default ???] With the exception of the FORTRAN, and possibly UPDATEIFCOPY, it would be an error to set any of these flags to False (Forcing an array to be discontiguous for instance, makes no sense). That's a thin interface and it ties together with the flags parameter nicely. On the downside, it's a little weird, particularly using OWNDATA for copy, although it is logical once you think about it. It also drops 'minnd' and 'subok'. I wouldn't miss them, but I expect someone would squawk. You could shoehorn them into flags, but then you lose one of the chief virtues of this scheme, which is that it makes a strong connection between the constructor and the flags parameter. 'subok' should be pretty well taken care of by 'asanyarray' and it would be easy enough to create an auxilliary function to replicate the 'minnd' functionality. With this kind of narrow interface, it might make sense to allow the flags parameter on all of the asX functions. That makes for a nice uniform, easy to remember, interface. I'm not positive I like this idea yet, but I thought it was interesting enough to throw into the ring anyway. Tangentially, I noticed that I can set ALIGNED to False. Isn't that going to break something somewhere? Should the ALIGNED flag be readonly? -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tangentially, I noticed that I can set ALIGNED to False. Isn't that going to break something somewhere? Should the ALIGNED flag be readonly?
Responding to the tangential topic :-) The ALIGNED flag can be set False because it allows one to test those sections of code that deal with misaligned data. I don't think it would break anything, because thinking that data is misaligned when it really is aligned only costs you in copy time. You can't set it TRUE if it's really mis-aligned, though, because thinking that data is aligned when it's really mis-aligned can cause segfaults on some platforms and slow code on others. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
quite true.
there's asarray() of course. My feeling is that functions that may or may not return a copy should be functions, like asarray(), that ONLY exist to ensure a particular invariant. ascontiguous() asarray() I imagine there are others. What concerns me is functions like reshape() and ravel() that you might have all sorts of other reasons to use, but then can't ever know for sure if your method is going to be working with a copy or not. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
There is the flatten() method which exists for precisely this reason (it *always* returns a copy). -Travis
![](https://secure.gravatar.com/avatar/25899bc1947e1ce40b16e55631c2c94a.jpg?s=120&d=mm&r=g)
On 3/23/06, Sasha <ndarray@mac.com> wrote:
I am just starting to use ma.array and would like to get some idea from those in the know of how close this is to reality. What percentage of functions designed for nd_arrays would work on a ma.array with no masked elements? That is if you have data with missing values, but then remove the missing values, is it necessary to convert back to a standard nd_array? The statistical language R deals with missing data fairly well. There are a number of functions for dealing with missing values (fail, omit, exclude, pass). Furthermore, there is a relatively standard way for a function to handle data with missing values, via an na.action parameter which indicates which function to call. http://spider.stat.umn.edu/R/library/stats/html/na.action.html http://spider.stat.umn.edu/R/library/stats/html/na.fail.html It would be nice to have a similar set of functions (including the fill function) for numpy. These functions could return the object without change if it is not a masked array, and if a masked array make the appropriate changes to return a nd_array or raise exception. A simple standard for indicating a function's the ability to handle masked data would be to include a mask_action parameter which holds or indicates a function for processing missing data. Also, are there any current plans to allow record type arrays to be masked? Thanks, Mike
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
I am posting a reply to my own post in a hope to generate some discussion of the original proposal. I am proposing to add a "filled" method to ndarray. This can be a pass-through, an alias to "copy" or a method to replace nans or some other type-specific values. This will allow code that uses "filled" work on ndarrays without changes. On 3/22/06, Sasha <ndarray@mac.com> wrote:
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
In general, I'm skeptical of adding more methods to the ndarray object -- there are plenty already. In addition, it appears that both the method and function versions of filled are "dangerous" in the sense that they sometimes return the array itself and sometimes a copy. Finally, changing ndarray to support masked array feels a bit like the tail wagging the dog. Let me throw out an alternative proposal. I will admit up front that this proposal is based on exactly zero experience with masked array, so there may be some stupidities in it, but perhaps it will lead to an alternative solution. def asUnmaskedArray(obj, fill_value=None): mask = getattr(obj, False) if mask is False: return obj if fill_value is None: fill_value = obj.get_fill_value() newobj = obj.data().copy() newobj[mask] = fill_value return newobj Or something like that anyway. This particular version should work on any array as long as if it exports a mask attribute it also exports get_fill_value and data. At least once any bugs are ironed out, I haven't tested it. ma would have to be modified to use this instead of using filled everywhere, but that seems more appropriate than tacking on another method to ndarray IMO. On advantage of this approach is that most array like objects that don't subclass ndarray will work with this automagically. If we keep expanding the methods of ndarray, it's harder and harder to implement other array like objects since they have to implement more and more methods, most of which are irrelevant to their particular case. The more we can implement stuff like this in terms of some relatively small set of core primitives, the happier we'll all be in the long run. This also builds on the idea of trying to push as much of the array/view ambiguity into the asXXXArray corner. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
I've also proposed to drop "fill" in favor of optimizing x[...] = <scalar>. Having both "fill" and "filled" in the interface is plain awkward. You may like the combined proposal better because it does not change the total number of methods :-) In addition, it appears that both the method and function versions of
filled are "dangerous" in the sense that they sometimes return the array itself and sometimes a copy.
This is true in ma, but may certainly be changed.
Finally, changing ndarray to support masked array feels a bit like the tail wagging the dog.
I disagree. Numpy is pretty much alone among the array languages because it does not have "native" support for missing values. For the floating point types some rudimental support for nans exists, but is not really usable. There is no missing values machanism for integer types. I believe adding "filled" and maybe "mask" to ndarray (not necessarily under these names) could be a meaningful step towards "native" support for missing values.
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Sasha wrote:
I agree strongly with you, Sasha. I get the impression that the world of numerical computation is divided into those who work with idealized "data", where nothing is missing, and those who work with real observations, where there is always something missing. As an oceanographer, I am solidly in the latter category. If good support for missing values is not built in, it has to be bolted on, and it becomes clunky and awkward. I was reluctant to speak up about this earlier because I thought it was too much to ask of Travis when he was in the midst of putting numpy on solid ground. But I am delighted that missing value support has a champion among numpy developers, and I agree that now is the time to change it from "bolted on" to "integrated". Eric
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Eric Firing wrote:
I think your experience is clouding your judgement here. Or at least this comes off as unnecessarily perjorative. There's a large class of people who work with data that doesn't have missing values either because of the nature of data acquisition or because they're doing simulations. I take zillions of measurements with digital oscillopscopes and they *never* have missing values. Clipped values, yes, but even if I somehow could queery the scope about which values were actually clipped or simply make an educated guess based on their value, the facilities of ma would be useless to me. The clipped values are what I would want in any case. I also do a lot of work with simulations derived from this and other data. I don't come across missing values here but again, if I did, the way ma works would not help me. I'd have to treat them either by rejecting the data outright or by some sort of interpolation.
This may be a false dichotomy. It's certainly not obvious to me that this is so. At least if "bolted on" means "not adding a filled method to ndarray".
I have no objection to ma support improving. In fact I think it would be great although I don't forsee it helping me anytime soon. I also support Sasha's goal of being able to mix MaskedArrays and ndarrays reasonably seemlessly. However, I do think the situation needs more thought. Slapping filled and mask onto ndarray is the path of least resistance, but it's not clear that it's the best one. If we do decide we are going to add both of these methods to ndarray (with filled returning a copy!), then it may worth considering making ndarray a subclass of MaskedArray. Conceptually this makes sense, since at this point an ndarray will just be a MaskedArray where mask is always False. I think that they could share much of the implementation except that ndarray would be set up to use methods that ignored the mask attribute since they would know that it's always false. Even that might not be worth it, since the check for whether mask is True/False is just a pointer compare. It may in fact be best just to do away with MaskedArray entirely, moving the functionality into ndarray. That may have performance implications, although I don't seem them at the moment, and I don't know if there are other methods/attributes that this would imply need to be moved over, although it looks like just mask, filled and possibly filled_value, although the latter looks a little dubious to me. Either of the above two options would certainly improve the quality of MaskedArray. Copy for instance seems not to have been implemented, and who knows what other dark corners remain unexplored here. There's a whole spectrum of possibilities here from ones that don't intrude on ndarray at all to ones that profoundly change it. Sasha's suggestion looks like it's probably the simplest thing in the short term, but I don't know that it's the best long term solution. I think it needs more thought and discussion, which is after all what Sasha asked for ;) Regards, -tim
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tim, The point is well-taken, and I apologize. I stated my case badly. (I would be delighted if I did not have to be concerned with missing values-they are a pain regardless of how well a numerical package handles them.)
I probably overstated it, but I think we actually agree. I intended to lend support to the priority of making missing-value support as seamless and painless as possible. It will help some people, and not others.
This is exactly the option that I was afraid to bring up because I thought it might be too disruptive, and because I am not contributing to numpy, and probably don't have the competence (or time) to do so.
Exactly! Thank you for broadening the discussion. Eric
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
Completely agree. I have many gripes about current ma implementation of both "filled" and "mask". filled: 1. I don't like default fill value. It should be mandatory to supply fill value. 2. It should return masked array (with trivial mask), not ndarray. 3. The name conflicts with the "fill" method. 4. View/Copy inconsistency. Does not provide a method to fill values in-place. mask: 1. I've got rid of mask returning None in favor of False_ (boolean array scalar), but it is still not perfect. I would prefer data.shape == mask.shape invariant and if space saving/performance is deemed necessary use zero-stride arrays. 2. I don't like the name. "Missing" or "na" would be better.
The tail becoming the dog! Yet I agree, this makes sense from the implementation point of view. From OOP perspective this would make sense if arrays were immutable, but since mask is settable in MaskedArray, making it constant in the subclass will violate the substitution principle. I would not object making mask read only, however.
I think MA can coexist with ndarray and share the interface. Ndarray can use special bit-patterns like IEEE NaN to indicate missing floating point values. Add-on modules can redefine arithmetic to make INT_MIN behave as a missing marker for signed integers (R, K and J (I think) languages use this approach). Applications that need missing values support across the board will use MA.
More (corners) than you want to know about! Reimplementing MA in C would be a worthwhile goal (and what you suggest seems to require just that), but it is too big of a project. I suggest that we focus on the interface first. If existing MA interface is rejected (which is likely) for ndarray, we can easily experiment with the alternatives within MA, which is pure python.
Exactly!
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
3. The name conflicts with the "fill" method. fillmask ? clog ?
Er... How many of us are using MA on a regular basis ? Aren't we a minority ? It'd seem wiser to adapt MA to numpy, in Python (but maybe that's the XIXe French integration model I grew up with that makes me talk here...)
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
That makes perfect sense. If anything should have a default fill value, it's the functsion calling filled, not the arrays themselves.
2. It should return masked array (with trivial mask), not ndarray.
So, just with mask = False? In a follow on message Pierre disagress and claims that what you really want is the ndarray since not everything will accept. Then I guess you'd need to call b.filled(fill).data. I agree with Sasha in principle but Pierre, perhaps in practice. I'm almost suggested it get renames a.asndarray(fill), except that asXXX has the wrong conotations. I think this one needs to bounce around some more.
3. The name conflicts with the "fill" method.
I thought you wanted to kill that. I'd certainly support that. Can't we just special case __setitem__ for that one case so that the performance is just as good if performance is really the issue?
4. View/Copy inconsistency. Does not provide a method to fill values in-place.
b[b.mask] = fill_value; b.unmask() seems to work for this purpose. Can we just have filled return a copy?
Interesting idea. Is that feasible yet?
2. I don't like the name. "Missing" or "na" would be better.
I'm not on board here, although really I'd like to here from other people who use the package. 'na' seems to cryptic to me and 'missing' to specific -- there might be other reasons to mask a value other it being missing. The problem with mask is that it's not clear whether True means the data is useful or unuseful. Keep throwing out names, maybe one will stick.
How do you set the mask? I keep getting attribute errors when I try it. And unmask would be a noop on an ndarray.
Perhaps MaskedArray should inherit from ndarray for the time being. Many of the methods would need to reimplemented anyway, but it would make asanyarray work. Someone was just complaining about asarray munging his arrays. That's correct behaviour, but it would be nice if asanyarray did the right thing. I suppose we could just special case asanyarray to ignore MaskedArrays, that might be better since it's less constraining from an implementation side too.
This may be an oportune time to propose something that's been cooking in the back of my head for a week or so now: A stripped down array superclass. The details of this are not at all locked down, but here's a strawman proposal. We add an array superclass. call it basearray, that has the same C-structure as the existing ndarray. However, it has *no* methods or attributes. It's simply a big blob of data. Functions that work on the C structure of arrays (ufuncs, etc) would still work on this arrays, as would asarray, so it could be converted to an ndarray as necessary. In addition, we would supply a minimal set of functions that would operate on this object. These functions would be chosen so that the current array interface could be implemented on top of them and the basearray object in pure python. These functions would be things like set_shape(a, shape), etc. They would be segregated off in their own namespace, not in the numpy core. [Note that I'm not proposing we actually implement ndarray this way, just that we make is possible]. This leads to several useful outcomes. 1. If we're careful, this could be the basic array object that we propose, at least for the first roun,d for inclusion in the Python core. It's not useful for anything but passing data betwen various application that understand the data structure, but that in itself could be a huge win. And the fact that it's dirt simple would probably be an advantage to getting it into the core. 2. It provides a useful marker class. MA could inherit from it (and use itself for it's data attribute) and then asanyarray would behave properly. MA could also use this, or a subclass, as the mask object preventing anyone from accidentally using it as data (they could always use it on purpose with asarray). 3. It provides a platform for people to build other, ndarray-like classes in Pure python. This is my main interest. I've put together a thin shell over numpy that strips it down to it's abolute essentials including a stripped down version of ndarray that removes most of the methods. All of the __array_wrap__[1] stuff works quite well most of the time, but there's still some issues with being a subclass when this particular class is conceptually a superclass. If we had an array superclass of some sort, I believe that these would be resolved. In principle at least, this shouldn't be that hard. I think it should mostly be rearanging some code and adding some wrappers to existing functions. That's in principle. In practice, I'm not certain yet as I haven't investigated the code in question in much depth yet. I've been meaning to write this up into a more fleshed out proposal, but I got distracted by the whole Protocol discussion on python-dev3000. This writeup is pretty weak, but hopefully you get the idea. Anyway, this is somethig that I would be willing to put some time on that would benefit both me and probably the MA folks as well. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
Just for the record. Currently MA does not inherit from ndarray. There are some benefits to be gained from changing MA design from containment to inheritance, by I am very sceptical about the use of inheritance in the array setting.
This is a very worthwhile idea and I hate to see it burried in a non-descriptive thread. I've copied your proposal to the wiki at <http://projects.scipy.org/scipy/numpy/wiki/ArraySuperClass>.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
Right, I checked that. That's why asanyarray won't work now with MA (unless someone changed the implementation of that while I wan't looking.
That's probably a sensible position. Still it would be nice to have asanyarray pass masked arrays through somehow. I haven't thought this through very well, but I wonder if it would make sense for asanyarray to pass any object that supplies __array__. I'm leary of special casing asanyarray just for MA; somehow that seems the wrong approach.
Thanks for doing that. I'm glad you like the general idea. I do plan to write it through and try to get a better handle on what this would entail and what the consequences would be. However, I'm not sure exactly when I'll get around to it so it's probably better that a rough draft be out there for people to think about in the interim. -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
It looks like we are getting close to a consensus on this one. I will remove fill_value attribute. [...]
I'll propose a patch.
+1
It is not feasible in pure python module like ma, but easy in ndarray. We can also reset the writeable flag to avoid various problems that zero strides may cause. I'll propose a patch.
The problem with the "mask" name is that ndarray already has unrelated "putmask" method. On the other hand putmask is redundant with fancy indexing. I have no other problem with "mask" name, so we may just decide to get rid of "putmask".
[...] How do you set the mask? I keep getting attribute errors when I try it.
a[i] = masked makes i-th element masked. If mask is an array, you can just set its elements.
And unmask would be a noop on an ndarray.
Yes. [...]
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
Well, if 'mask' became a default argument of ndarray, that wouldn't be a pb any longer. I'm quite for that.
tondarray(fill) ?
Yes !
The problem with mask is that it's not clear whether True means the data is useful or unuseful.
I have to think twice all the time I want to create a mask that True means in fact that I don't want the data, whereas True selects the data for ndarray...
"putmask" really seems overkill indeed. I wouldn't miss it.
How do you set the mask? I keep getting attribute errors when I try it. And unmask would be a noop on an ndarray.
I've implemented something like that for some classes (inheriting from MA.MaskedArray). Never really used it yet, though #-------------------------------------------- def applymask(self,m): if not MA.is_mask(m): raise MA.MAError,"Invalid mask !" elif self._data.shape != m.shape: raise MA.MAError,"Mask and data not compatible." else: self._dmask = m
That'd be great indeed, and may solve some problems reported on th list about subclassing ndarray. AAMOF, I gave up trying to use ndarray as a superclass, and rely only on MA
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Pierre GM wrote:
decide to get rid of "putmask".
"putmask" really seems overkill indeed. I wouldn't miss it.
I'm not opposed to getting rid of putmask either. Several of the newer methods are open for discussion before 1.0. I'd have to check to be sure, but .take and .put are not entirely replaced by fancy-indexing. Also, fancy indexing has enough overhead that a method doing exactly what you want is faster. -Travis
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
This is in essence what I've been proposing since SciPy 2005. I want what goes into Python to be essentially just this super-class. Look at this http://numeric.scipy.org/array_interface.html and check out this svn co http://svn.scipy.org/svn/PEP arrayPEP I've obviously been way over-booked to do this myself. Nick Coughlan expressed interest in this idea (he called it dimarray, but I like basearray better).
Why not give it the attributes corresponding to it's C-structure. I'm happy with no methods though.
The only extra thing I'm proposing is to add the data-descriptor object into the Python core as well --- other-wise what do you do with PyArray_Descr * part of the C-structure?
This is exactly what needs to be done to improve array-support in Python. This is the conclusion I came to and I'm glad to see that Tim is now basically having the same conclusion. There are obviously some details to work out. But, having a base structure to inherit from would be perfect. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
I'll look these over. I suppose I should have been paying more attention before!
Mainly because I didn't want too much about whether a given method or attribute was a good idea and I was in a hurry when I tossed that proposal out. It seemed better to start with the most stripped down proposal I could come up and see what people demanded I add.. I'm actually sort of inclined to give it *read-only* attribute associated with C-structure, but no methods. That way you can examine the shape, type, etc but you can't set them [I'm specifically thinking of shape here, but there may be others].. I think that there are cases where you don't want the base array to be mutable at all, but I don't think introspection should be a problem. If the attributes were setabble, you could always override the them with readonly properties, but it'd be cleaner to just start with readonly functionality and add setability (is that a word?) only in those cases where it's needed.
Good point.
Hmm. This idea seems to have a fair bit of consensus behind it. I guess that means I better looking into exactly what it would take to make it work. The details of what attributes to expose, etc are probably not too important to work out immediately. Regards, -tim
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
Folks, I'm more or less in Eric's field (hydrology), and we do have to deal with missing values, that we can't interpolate straightforwardly (that is, without some dark statistical magic). Purely discarding the data is not an option either. MA fills the need, most of it. I think one of the issues is what is meant by 'masked data': - a missing observation ? - a NAN ? - a data we don't want to consider at one particular point ? For the last point, think about raster maps or bitmaps: calculations should be performed on a chunk of data, the initial data left untouched, and the result should both have the same size as the original, and valid only on the initial chunk. The current MA implementation, with its _data part and is _mask part, works nicely for the 3rd point. - I wonder whether implementing a 'filled' method for ndarrays is really better than letting the user create a MaskedArray, where the NANs are masked.In any case, a 'filled' method should always return a copy, as it's no longer the initial data. - I'm not sure what to do with the idea of making ndarray a subclass of MA . One on side, Tim pointed rightly that a ndarray is just a MA with a 'False' mask. Actually, I'm a bit frustrated with the standard 'asarray' that shows up in many functions. I'd prefer something like "if the argument is a non-numpy sequence (tuples,lists), transforming it in a ndarray, but if it's already a ndarray or a MA, leave it as it is. Don't touch the mask if present". That's how MA.asarray works, but unfortunately the std "asarray" gets rid of the mask (and you end up with something which is not what you'd expect). A 'mask=False' attribute in ndarray would be nice. On another, some methods/functions make sense only on unmasked ndarray (FFT, solving equations), some others are a bit tricky to implement (diff ? median...). Some exception could be raised if the arguments of these functions return True with ismasked (cf below), or that could be simplified if 'mask' was a default attribute of numarrays. I regularly have to use a ismasked function (cf below). def ismasked(a): if hasattr(a,'mask'): return a.mask.any() else: return False We're going towards MA as the default object. But then again, what would be the behavior to deal with missing values ? Using R-like na.actions ? That'd be great, but it's getting more complex. Oh, and another thing: if 'mask', or 'masked' becomes a default attribute of ndarrays, how do we define a mask? As a boolean ndarray whose 'mask' is always 'False' ? How do you __repr__ it ? - I agree that 'filled_value' is not very useful. If I want to fill an array, I'm happy to specify what value I want it filled with. In facts, I'd be happier to specifiy 'values'. I often have to work with 2D arrays, each column representing a different variable. If this array has to be filled, I'd like each column to be filled with one particular value, not necessarily the same along all columns: something like column_stack([A[:,k].filled(filler[k]) for k in range(A.shape[1])]) with filler a 1xA.shape[1] array of filling values. Of course, we could imagine the same thing for rows, or higher dimensions... Sorry for the rants...
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
... We're going towards MA as the default object.
I will be against changing the array structure to handle missing values. Let's keep the discussion focuced on the interface. Once we agree on the interface, it will be clear if any structural changes are necessary.
But then again, what would be the behavior to deal with missing values ?
We can postpone this discussion as well. Just add mask attribute that returns False and filled method that returns a copy is an example of a minimalistic change.
Using R-like na.actions ? That'd be great, but it's getting more complex.
I don't like na.actions. I think missing values should behave like IEEE NaNs and in the floating point case should be represented by NaNs. The functionality provided by na.actions can always be achieved by calling an extra function (filled or compress).
See above. For ndarray mask is always False unless an add-on module is loaded that redefines arithmetic to recognize special bit-patterns such as NaN or INT_MIN.
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, On 4/7/06, Sasha <ndarray@mac.com> wrote:
I think that the usage of MA is important because this often dictates the interface. The other aspect is the penalty that is imposed by requiring a masked features especially to situations that don't need any of these features.
I think the issue related to how masked values should be handled in computation. Does it matter if the result of an operation is due to a masked value or numerical problem (like dividing by zero)? (I am presuming that it is possible to identify this difference.) If not, then I support the idea of treating masked values as NaN.
The functionality provided by na.actions can always be achieved by calling an extra function (filled or compress).
I am not clear on what you actually mean here. For example, if you are summing across a particular dimension, I would presume that any masked value would be ignored an that there would be some record of the fact that a masked value was encountered. This would allow that 'extra function' to handle the associated result. Alternatively the 'extra function' would have to be included as an argument - which is what the na.actions do. Regards Bruce
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Bruce Southey <bsouthey@gmail.com> wrote:
IEEE standard prvides plenty of spare bits in NaNs to represent pretty much everything, and some languages take advantage of that feature. (I believe NA and NaN are distinct in R). In MA, however mask elements are boolean and no distinction is made between various reasons for not having a data element. For consistency, a non-trivial (not always false) implementation of ndarray.mask should return "not finite" and ignore bits that distinguish NaNs and infinities.
If you sum along a particular dimension and encounter a masked value, the result is masked. The same is true if you encounter a NaN - the result is NaN. If you would like to ignore masked values, you write a.filled(0).sum() instead of a.sum(). In 1d case, you can also use a.compress().sum(). In other words, what in R you achieve with a flag, such as in sum(a, na.rm=TRUE), in numpy you achieve by an explicit call to "fill". This is not quite the same as na.actions in R, but that is what I had in mind.
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
If you sum along a particular dimension and encounter a masked value, the result is masked.
That's not how it currently works (still on 0.9.6): x=arange(12).reshape(3,4) MA.masked_where((x%5==0) | (x%3==0),x).sum(0) array(data = [12 1 2 18], mask = [False False False False], fill_value=999999) and frankly, I'd be quite frustrated if it had to change: - `filled` is not a ndarray method, which means that a.filled(0).sum() fails if a is not MA. Right now, I can use a.sum() without having to check the nature of a first. - this behavior was already in Numeric - All my scripts rely on it (but I guess that's my problem) - The current way reflects how mask are used in GIS or image processing.
Once again, Sasha, I'd agree with you if it wasn't a major difference
I kinda like the idea of a flag, though
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
ma.array([1,1], mask=[0,1]).sum() 1
This is exactly the point of the current discussion: make fill a method of ndarray. With the current behavior, how would you achieve masking (no fill) a.sum()?
- this behavior was already in Numeric
That's true, but it makes the result of sum(a) different from __builtins__.sum(a). I believe consistency with the python conventions is more important than with legacy Numeric in the long run.
[...]
- The current way reflects how mask are used in GIS or image processing.
Can you elaborate on this? Note that in R na.rm is false by default in sum:
sum(c(1,NA)) [1] NA
So it looks like the convention is different in the field of statistics.
Array methods are a very recent addition to ma. We can still use this window of opportunity to get things right before to many people get used to the wrong behavior. (Note that I changed your implementation of cumsum and cumprod.)
With the flag approach making ndarray and ma.array interfaces consistent would require adding an extra argument to many methods. Instead, I poropose to add one method: fill to ndarray.
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
MA.array([[1,1],[1,1]],mask=[[0,1],[1,0]]).sum() array(data = [1 1], mask = [False False], fill_value=999999)
MA.array([[1,1],[1,1]],mask=[[0,1],[1,1]]).sum() array(data = [1 999999], mask = [False True], fill_value=999999) With a.filled(0).sum(), how would you distinguish between the cases (a) at least one value is not masked and (b) all values are masked ? (OK, by querying the mask with something in the line of a a._mask.all(axis), but it's longer... Oh well, I'll just to adapt)
Good points... We'll just have to put strong warnings everywhere.
MMh. *digs in his old GRASS scripts* OK, my bad. I had to fill missing values somehow, or at least check whether there were any before processing. I'll double check on that. Please temporarily forget that comment.
On a semantic aspect: While digging these GRASS scripts I mentioned, I realized/remembered that masked values are called 'null', when there's no data, a NAN, or just when you want to hide some values. What about 'null' instead of 'mask','missing','na' ?
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Pierre GM wrote:
Any number of reasons I would think. It depends on what your using the data for. If the sum is the total amount that you spent in the month, and a masked value means you lost that check stub, then you don't know how much you actually spent and that value should be masked. To chose a boring example.
Actually I'm going to ask you the same question. Why would care if all of the values are masked? I may be missing something, but either there's a sensible default value, in which case it doesn't matter how many values are masked, or you can't handle any masked values and the result should be masked if there are any masks in the input. Sasha's proposal handle those two cases well. Your behaviour a little more clunkily, but I'd like to understand why you want that behaviour. Regards, -tim
![](https://secure.gravatar.com/avatar/6c32e3d6cb67dac69ef7a3504b187a7c.jpg?s=120&d=mm&r=g)
OK, now I get it :)
I understand that, and I eventually agree it should be the default.
Masked values are not necessarily nans or missing. I quite regularly mask values that do not satisfy a given condition. For various reasons, I can't compress the array, I need to preserve its shape. With the current behavior, a.sum() gives me the sum of the values that satisfy the condition. If there's no such value, the result is masked, and that way I know that the condition was never met. Here, I could use Sasha's method combined with a._mask.all, no problem Another example: let x a 2D array with missing values, to be normalized along one axis. Currently, x/x.sum() give the result I want (provided it's true division). Sasha's method would give me a completely masked array.
Your points are quite valid. I'm just worried it's gonna break a lot of things in the next future. And where do we stop ? So, if we follow Sasha's way: x.prod() should be the same, right ? What about a.min(), a.max() ? a.mean() ?
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, My view is solely as user so I really do appreciate the thought that you all are putting into this! I am somewhat concerned that having to use filled() is an extra level of complexity and computational burden. For example, in computing the mean/average I using filled would require a one effort to get the sum and another to count the non-masked elements. For at least summation would it make more sense to add an optional flag(s) such that there appears little difference between a normal array and a masked array? For example, a.sum() is the current default a.sum(filled_value=x) where x is some value such as zero or other user defined value. a.sum(ignore_mask=True) or similar to address whether or not masked values should be used. I am also not clear on what happens with other operations or dimensions. Regards Bruce On 4/10/06, Pierre GM <pierregm@engr.uga.edu> wrote:
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
As I understand it, the goal that Sasha is pursuing here is to make masked arrays and normal arrays interchangeable as much as practical. I believe that there is reasonable consensus that this is desirable. Sasha has proposed a compromise solution that adds minimal attributes to ndarray while allowing a lot of interoperability between ma and ndarray. However it has it's clunky aspects as evidenced by the pushback he's been getting from masked array users. Here's one example. In the masked array context it seems perfectly reasonable to pass a fill value to sum. That is: x.sum(fill=0.0) But, if you want to preserve interoperability, that means you have to add fill arguments to all of the ndarray methods and what do you have? A mess! Particularly is some *other* package comes along that we decide is important to support in the same manner as ma. Then we have another set of methods or keyword args that we need to tack on to ndarray. Ugh! However, I know who, or rather what, to blame for our problems: the object-oriented hype industry in general and Java in particular <0.1 wink>. Why? Because the root of the problem here is the move from functions to methods in numpy. I appreciate a nice method as much as the nice person, but they're not always better than the equivalent function and in this case they're worse. Let's fantasize for a minute that most of the methods of ndarray vanished and instead we went back to functions. Just to show that I'm not a total purist, I'll let the mask attribute stay on both MaskedArray and ndarray. However, filled bites the dust on *both* MaskedArray and ndarray just like the rest. How would we deal with sum then? Something like this: # ma.py def filled(x, fill): x = x.copy() if x.mask is not False: x[x.mask] = value x.umask() return x def sum(x, axis, fill=None): if fill is not None: x = filled(x, fill) # I'm blowing off the correct treatment of the fill=None case here because I'm lazy return add.reduce(x, axis) # numpy.py (or __init__ or oldnumeric or something) def sum(x, axis): if x.mask is not False: raise ValueError("use ma.sum for masked arrays") return add.reduce(x, axis) [Fixing the fill=None case and dealing correctly dtype is left as an exercise for the reader.] All of the sudden all of the problems we're running into go away. Users of masked arrays simply use the functions from ma and can use ndarrays and masked arrays interchangeably. On the other hand, users of non-masked arrays aren't burdened with the extra interface and if they accidentally get passed a masked array they quickly find about it (you don't want to be accidentally using masked arrays in an application that doesn't expect them -- that way lies disaster). I realize that railing against methods is tilting at windmills, but somehow I can't help myself ;-| Regards, -tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
[Tim rant's a lot] Just to be clear, I'm not advocating getting rid of methods. I'm not advocating anything, that just seems to get me into trouble ;-) I still blame Java though. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
I am just making your point with a shorter example.
It looks like there is little opposition here. I'll submit a patch soon and unless better names are suggested, it will probably go in.
With the current behavior, how would you achieve masking (no fill) a.sum()? Er, why would I want to get MA.masked along one axis if one value is masked ?
Because if you don't know one of the addends you don't know the sum. Replacing missing values with zeros is not always the right strategy. If you know that your data has non-zero mean, for example, you might want to replace missing values with the mean instead of zero.
I did not realize that, but it is really bad. What is the justification for this? In R:
sum(c(NA,NA), na.rm=TRUE) [1] 0
What does MATLAB do in this case?
Exactly. Explicit is better than implicit. The Zen of Python <http://www.python.org/dev/peps/pep-0020>.
Do you agree with my proposal as long as we have explicit warnings in the documentation that methods behave differently from legacy functions?
[... GIS comment snipped ...]
I don't think "null" returning an array of bools will create a lot of enthusiasm. It sounds more like ma.masked as in a[i] = ma.masked. Besides, there is probably a reason why python uses the name "None" instead of "Null" - I just don't know what it is :-).
![](https://secure.gravatar.com/avatar/25899bc1947e1ce40b16e55631c2c94a.jpg?s=120&d=mm&r=g)
On 4/11/06, Sasha <ndarray@mac.com> wrote:
I feel that in general implicitly replacing masked values will definitely lead to bugs in my code. Unless it is really obvious what the best way to deal with the masked values is for the particular function, then I would definitely prefer to be explicit about it. In most cases there are a number of reasonable options for what can be done. Masking the result when masked values are involved seems the most transparent default option. For example, it gives me a really bad feeling to think that sum will automatically return the sum of all non-masked values. When dealing with large datasets, I will not always know when I need to be careful of missing values. Summing over the non-masked arrays will often not be the appropriate course and I fear that I will not notice that this has actually occurred. If masked values are returned it is pretty obvious what has happened and easily to go back and explicitly handle the masked data in another way if appropriate. Mike
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Sasha wrote:
Supporting missing values is a useful thing (but not for every usage of arrays). Thus, ultimately, I see missing-value arrays as a solid sub-class of the basic array class. I'm glad Sasha is working on missing value arrays and have tried to be supportive. I'm a little hesitant to add a special-case method basically for one particular sub-class, though, unless it is the only workable solution. We are still exploring this whole sub-class space and have not really mastered it... -Travis
![](https://secure.gravatar.com/avatar/ccb440c822567bba3d49d0ea2894b8a1.jpg?s=120&d=mm&r=g)
In article <d38f5330604071219j6a5adbdw4a300ed10a26a445@mail.gmail.com>, Sasha <ndarray@mac.com> wrote:
I completely agree with this. I would really like to see proper native support for arrays with masked values in numpy (such that all ufuncs, functions, etc. work with masked arrays). I would be thrilled to be able to filter masked arrays, for instance. -- Russell
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Sasha wrote:
This would be *very* nice.
It seems to me that any function or method that returns an array from an array should be perfectly consistent and explicit about whether it makes a copy or not. Sometimes the filled method *needs* to return a copy; therefore it should *always* return a copy, regardless of the presence or state of masking. Hence I think the filled method of ma needs to be changed in this way also. The question for your suggestion 3 is, should a nan always be the equivalent of a masked value? One loses a little flexibility, but it has an appealing simplicity to it. I could be persuaded otherwise, but right now I would vote for it. Eric
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tim makes a good point here. Should the reshape method be fixed to always return a copy? The semantics a.shape = (...) could still be used to re-shape contiguous arrays where possible. However, whether or not reshape returns a copy is consistent (but perhaps not explicitly explained). We will still have .ravel() which sometimes copies and sometimes doesn't. -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/22/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Reshape never copies the data:
The only inconsistency is that
I agree that this is unnecessary, but don't see much of a problem. +0 here
However, whether or not reshape returns a copy is consistent (but perhaps not explicitly explained).
To me consistency means "is independent of the input." Whether or not reshape creates a new python object depends on the value of the argument. I would call it inconsistency.
We will still have .ravel() which sometimes copies and sometimes doesn't.
Ravel should be a shortcut for x.reshape((x.size,)), so it is really the same question. +0 (to make ravel always return a new python object)
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
And so articulately stated too ;)
My opinion is that all methods and functions should either: 1. Always return a copy. 2. Always return a view 3. Return a view if possible otherwise raise an exception. So, like Sasha, I'd like to see ravel changed as well. I don't really care if it's to 1 or 3 though. -tim
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Well, but is copy/view the /only/ invariant worth guaranteeing? I think there is a valid need for functions which ensure other invariants, such as contiguity. There are applications (such as passing pointers to C/Fortran libraries which don't have striding mechanisms but will not modify their inputs) which require contiguous inputs, but where one would rather make a copy only if necessary. My take on this is that we should /document/ clearly what invariants any given function satisfies, but I think the 'always view/always copy' view excludes an important usage case. There may be others beyond contiguity, but that's the one that pops immediately to mind. Cheers, f
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
This is a different case, I think. The result of this copy is not user visible except in terms of performance. I'm only concerned with functions that *return* copies or views depending on the input. I don't care if a function sometimes makes a copy under the covers but doesn't return it.
I don't think we're in disagreement here although I'm not sure. I will add, on the subject of continuity, that I think there should be a function 'ascontiguous' that parallels asarray, but assures that the result is contiguous. Although this sometimes returns a copy, I thinks that's OK since that's it's job. I would like to see all of the implicit copying pushed into functions like asarray and ascontiguous. This also helps efficiency. Imagine I have some calls to functions that require contiguous arrays and do copies under the covers if their args are not contiguous. In that case: a = ascontiguous(a) param1 = computeSomethingOnContiguousData(a) param2 = computeSomethingElseOnContiguousData(a) # etc. Will be much more efficient than the equivalent code without the ascontiguous when the initial a value is not discontiguous. Regards, -tim
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
I think we agree: something like ascontiguous() is precisely what I had in mind (I think that's what ravel() does today, but I'm fine if it gets a new name, as long as the functionality exists). Obviously a function like this should explicitly (docstring) say that it does NOT make any guarantees about whether its return value is a view or a copy, just that it's contiguous. Cheers, f
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
a.ravel() seems to be equivalent to reshape(a, [-1]). That is, it returns a flattened, contiguous copy. ascontiguous(a) would be slightly different in that it would preserves the shape of a. In fact I think it would look a lot like: def ascontiguous(a): """ascontiguous(a) -> contiguous representation of a. If 'a' is allready contiguous, it is returned unchanged. Otherwise, a contiguous copy is returned. """ a = asarray(a) if not a.flags['CONTIGUOUS']: a = array(a) return a
I agree. Regards, -tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
I was just looking at the interface for array and asarray to see what other stuff should go in the interface of the hypothetical ascontiguous. There's 'dtype', which I knew about, and 'fortran', which I didn't, but which makes sense. However, there's also 'ndmin'. First off, it's not described in docstring for asarray, but I was able to find it in the docstring for array without a problem. Second, is it really necessary? It seems to be useful in an awfully narrow set of circumstances, particularly since when you are padding axes not everyone wants to pad to the left. It would seem to be more useful to ditch the ndmin and have some sort of paddims function that was more full featured (padding to either the left or the right at a minimum). I'm not entirely sure what the best interface to such a beast would look like, but a simple tactic would be to just provide leftpaddims and rightpaddims. If it's not allready clear by now (;), I prefer several narrow interfaces to one broad one. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Padding to the left is "default" behavior for broadcasting and so it seems appropriate. This is how all lower-dimensional arrays are interpreted as "higher" dimensional arrays throughout the code. The ndmin is very handy as atested to by the uses of atleast_1d or atleast_2d in numpy library code. It was added later as an optimization step because of the number of library routines that were using it. I've since used it several times to simplify code. I think an ascontinguous on the Python level is appropriate since such a beast exists on the C-level. On the other hand, while Tim prefers narrow interfaces, the array_from_object interface is traditionally pretty broad. Thus, in my mind, the array call should get another flag keyword that forces a contiguous result. This translates easily to the C-domain, in much the same way as the fortran keyword does. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
That makes some sense.
OK, I'll take your word for it.
This doesn't bother me since I long ago gave up any hope that the array constructor would have a narrow interface.
This translates easily to the C-domain, in much the same way as the fortran keyword does.
I'll buy that. While I accept array() needs a wide interface, I still prefer to keep as many other interfaces as possible narrow. In particular, is ndmin widely used in asarray? Or do the library routines generally use array instead. Given the choice I'd sweep as much of the dust, AKA wideness, into array90 as possible since that's irredeemably wide anyway and keep the other interfaces as narrowly focused as possible. Put another way, asarray and ascontiguous are about clarity of intent. With too much extra baggage, the intent becomes obscured. The coupling seems tight enough for dtype and fortran, but once you get to ndmin, it seems that you might as well go with the big guns and break out "array(x, copy=False, ndmin=n)". That's my $0.02 on this subject and I'll shut up about it now. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Not necessarily sane advice :-) --- I might be overstating things. I know that atleast_1d and atleast_2d are used all over the place in scipy. This makes sense and I can certainly understand it. I'm willing to modify things to give narrow interfaces. Right now, since the requesting both fortran and contiguous does not make sense, setting the fortran flag to false enforces C-style contiguous while setting it to True enforces fortran-style. Setting it to None (the default) specifies you don't care and the behavior will be to create C-style contiguous for new arrays and use the striding specified by the array if it's already an array. I admit that it is arguable whether or not the fortran flag should be overloaded like this or not. There are now ascontiguous and asfortran functions with fairly minimal interfaces to make it simpler. There is also a check to make sure that array is called with no more than 2 non keyword arguments. Thus, you won't be able to confuse which flag is which. -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/23/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Thus, in my mind, the array call should get another flag keyword that forces a contiguous result.
Please don't! The fortran flag is bad enough, but has too much history behind it. Let's not breed boolean parameters. Sooner or later someone will use keword arguments positionally and you will end up guessing what array([1,2], int8_, 1, 1, 0, 0) means.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Sasha wrote:
There are several boolean flags in the interface already. Adding another one won't change the current situation that you describe. There are several ways to handle this. For one, we could force the use of keyword arguments, so that the position problem does not arise. Sasha has mentioned in the past a strides array argument, but I think the default fortran and contiguous strides cases need better support then just one of many possible stridings so I wouldn't go that direction here. I'm debating whether or not the fortran flag should be used to specify both contiguous and fortran cases. Right now, the fortran argument is a three-case flag with dont-care, True, and False arguments. It seems natural to have True mean force-fortran and False mean force-contiguous with dont-care (the default) mean take an array already given (or create a C-contiguous array if we are generating a new array from another object). At any rate, if the fortran flag is there, we need to specify the contiguous case as well. So, either propose a better interface (we could change it still --- the fortran flag doesn't have that much history) to handle the situation or accept what I do ;-) -Travis
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/23/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Let me try. I propose to eliminate the fortran flag in favor of a more general "strides" argument. This argument can be either a sequence of integers that becomes the strides, or a callable object that takes shape and dtype arguments and return a sequence that becomes the strides. For fortran and c order functions that generate appropriate stride sequences should be predefined to enable array(..., strides=fortran, ...) and array(..., strides=contiguous).
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
I like the idea of being able to create an array with custom strides. The applications aren't entirely clear yet, but it does seem like it could have some interesting and useful consequences. That said, I don't think this belongs in 'array'. Historically, array has been used for all sorts of array creation activities, which is why it always seems to have a wide, somewhat incoherent interface. However, most uses of array() boil down to one thing: creating a *new* array from a python object. My preference would be to focus on that functionality for array() and spin of it's other historical uses and new uses, like this custom strided array stuff, into separate factory functions. For example (and just for example, I make no great claims for either this name or interface): a = array_from_data(a_buffer_object, dtype, dims, strides) One thing that you do make clear is that contiguous and fortran should really two values of the same flag. If you combine this with one other simplification: array() always copies, we end up with a nice thin interface: # Create a new array in 'order' order. Defaults to "C" order. array(object, dtype=None, order="C"|"FORTRAN") and # Returns an array. If object is an array and order is satisfied, return object otherwise a new array. # If order is set the returned array will be contiguous and have that ordering asarray(object, dtype=None, order=None|"C"|"FORTRAN") # Just the same, but allow subtypes. asanyarray(object, dtype=None, order=None|"C"|"FORTRAN") You could build asarray, asanyarray, etc on top of the proposed array without problems by using type(object)==ndarray and isinstance(type, ndarray) respectively. Stuff like convenience functions for minnd would also be easy to build on top of there. This looks great to me (pre-coffee). Embrace simplicity: you have nothing to lose but your clutter;) Regards, -tim
![](https://secure.gravatar.com/avatar/b24e93182e89a519546baa7bafe054ed.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Please see the transpose example above.
I feel that [***] above is much cleaner than this. I suggest that string constants be deprecated.
If [***] above were adopted, it would still be helpful to adopt numarray's iscontiguous method, or better, use a property. colin W.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Colin J. Williams wrote:
This is true, but irrelevant. To the best of my knowledge, the only reason to force an array to be in a specific order is to pass it to a C function that expects either FORTRAN- or C-ordered arrays. And, in that case, the array also needs to be contiguous. So, for the purpose of creating arrays (and for the purposes of ascontiguous), the only cases that matter are arrays that are both contiguous and the specified order. Thus, specifying continuity and order separately to the constructor needlessly complicates the interface. Or since I'm feeling jargon happy today, YAGNI.
I'm no huge fan of string constants myself, but I think you need to think this through more. First off, the interface I tossed off above doesn't cover the same ground as array, since it works off an already created buffer object. That means you'd have to go through all sorts of contortions and do at least one copy to get data into Fortran order. You could allow arbitrary, 1D, python sequences instead, but that doesn't help the common case of converting a 2D python object into a 2D array. You could allow N-D python objects, but then you have two ways of specifying the dims of the object and things become a big krufty mess. Compared to that string constants are great.
-0. In my experience, 99% of my use cases would be covered for ascontiguous and for the remaining 1% I'm happy to use a.flags.contiguous. Regards, -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Removing the copy flag will break a lot of code because it's been around for a long time. This is also not an "easy thing" to add to convertcode.py though I suppose 90% of the cases could be found. We would also have to re-write asarray to be an additional C-function to make it not copy but make array copy. So, for now I'm not as enthused about that idea. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
Great!
I kinda figured on that. But I figured I'd propose my favorite and see what came of it.
We would also have to re-write asarray to be an additional C-function to make it not copy but make array copy.
I thought so too at first, but I don't this is is so. Untested, and can could probably be cleaned up some: def asarray(obj, order=None): if type(obj) == ndarray: if order: if not obj.flags.contiguous: return array(obj, order) if order == "C" and obj.flags.fortran: return array(obj, order) if order == "FORTRAN" and not obj.flags.fortran: return array(obj, order) return obj else: if order: return array(obj, order) else: return array(obj) For asanyarray, simply replace the type test with an isinstance test.
So, for now I'm not as enthused about that idea.
Yeah. Without backward compatibility constraints I'm convinced that it's the right thing to do, but I realize there is a need to balance making the transistion manageable with making things "perfect". -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
And this is now done... So, thankfully, the fortran= keyword is gone and replaced with the more sensible order= keyword. Tests for numpy pass, but any other code that used fortran= will need changing. Sorry about that... Thanks, -Travis
![](https://secure.gravatar.com/avatar/25ef0a6698317c91220d6a1a89543df3.jpg?s=120&d=mm&r=g)
On 3/24/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
This looks very similar to the current ndarray "new" constructor: | ndarray.__new__(subtype, shape=, dtype=int_, buffer=None, | offset=0, strides=None, fortran=False) | | There are two modes of creating an array using __new__: | 1) If buffer is None, then only shape, dtype, and fortran | are used | 2) If buffer is an object exporting the buffer interface, then | all keywords are interpreted. | The dtype parameter can be any object that can be interpreted | as a numpy.dtype object. (see pydoc numpy.ndarray) I would not mind to leave array() unchanged and move discussion to streamlining ndarray.__new__ . For example, some time ago I suggested that strides should be interpreted even if buffer=None.
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 3/24/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
This looks very similar to the current ndarray "new" constructor: | ndarray.__new__(subtype, shape=, dtype=int_, buffer=None, | offset=0, strides=None, fortran=False) | | There are two modes of creating an array using __new__: | 1) If buffer is None, then only shape, dtype, and fortran | are used | 2) If buffer is an object exporting the buffer interface, then | all keywords are interpreted. | The dtype parameter can be any object that can be interpreted | as a numpy.dtype object. (see pydoc numpy.ndarray) I would not mind to leave array() unchanged and move discussion to streamlining ndarray.__new__ . For example, some time ago I suggested that strides should be interpreted even if buffer=None.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Alexander Belopolsky wrote:
It sure does.
That does look like a good place to hang any arbitrary strides creating stuff. I'm of the opinion that arguments should *never* be ignored, so I'm all for interpreting strides even when buffer is None. I'd also contend that offset should either be respected (by overallocating) or since that's probably useless, raising a ValueError when it's nonzero. Regards, Tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
We could always force keyword parameter by using **kwd. That would mess mess up help() and other introspection tools. Then again, if we want to discourage overuse of this sort of stuff, perhaps that's not a bad thing ;) But if we really need access to the C guts of an array, just allow a dictionary of flags to get plugged in. These would be the same as what somearray.flags puts out. The new interface for array would be: array(object, **flags) Where flags could contain: CONTIGUOUS : force a copy if object isn't contiguous if True. [default None] FORTRAN : force array to be fortran order if True, C order if False [default None] OWNDATA : force a copy if True [default True] WRITEABLE : Force a copy if object isn't writeable if Tru [default None] ALIGNED : Force a copy if object isn't aligned if True. [default None] UPDATEIFCOPY : Set the UPDATEIFCOPY flag? [default ???] With the exception of the FORTRAN, and possibly UPDATEIFCOPY, it would be an error to set any of these flags to False (Forcing an array to be discontiguous for instance, makes no sense). That's a thin interface and it ties together with the flags parameter nicely. On the downside, it's a little weird, particularly using OWNDATA for copy, although it is logical once you think about it. It also drops 'minnd' and 'subok'. I wouldn't miss them, but I expect someone would squawk. You could shoehorn them into flags, but then you lose one of the chief virtues of this scheme, which is that it makes a strong connection between the constructor and the flags parameter. 'subok' should be pretty well taken care of by 'asanyarray' and it would be easy enough to create an auxilliary function to replicate the 'minnd' functionality. With this kind of narrow interface, it might make sense to allow the flags parameter on all of the asX functions. That makes for a nice uniform, easy to remember, interface. I'm not positive I like this idea yet, but I thought it was interesting enough to throw into the ring anyway. Tangentially, I noticed that I can set ALIGNED to False. Isn't that going to break something somewhere? Should the ALIGNED flag be readonly? -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tangentially, I noticed that I can set ALIGNED to False. Isn't that going to break something somewhere? Should the ALIGNED flag be readonly?
Responding to the tangential topic :-) The ALIGNED flag can be set False because it allows one to test those sections of code that deal with misaligned data. I don't think it would break anything, because thinking that data is misaligned when it really is aligned only costs you in copy time. You can't set it TRUE if it's really mis-aligned, though, because thinking that data is aligned when it's really mis-aligned can cause segfaults on some platforms and slow code on others. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
quite true.
there's asarray() of course. My feeling is that functions that may or may not return a copy should be functions, like asarray(), that ONLY exist to ensure a particular invariant. ascontiguous() asarray() I imagine there are others. What concerns me is functions like reshape() and ravel() that you might have all sorts of other reasons to use, but then can't ever know for sure if your method is going to be working with a copy or not. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
There is the flatten() method which exists for precisely this reason (it *always* returns a copy). -Travis
![](https://secure.gravatar.com/avatar/25899bc1947e1ce40b16e55631c2c94a.jpg?s=120&d=mm&r=g)
On 3/23/06, Sasha <ndarray@mac.com> wrote:
I am just starting to use ma.array and would like to get some idea from those in the know of how close this is to reality. What percentage of functions designed for nd_arrays would work on a ma.array with no masked elements? That is if you have data with missing values, but then remove the missing values, is it necessary to convert back to a standard nd_array? The statistical language R deals with missing data fairly well. There are a number of functions for dealing with missing values (fail, omit, exclude, pass). Furthermore, there is a relatively standard way for a function to handle data with missing values, via an na.action parameter which indicates which function to call. http://spider.stat.umn.edu/R/library/stats/html/na.action.html http://spider.stat.umn.edu/R/library/stats/html/na.fail.html It would be nice to have a similar set of functions (including the fill function) for numpy. These functions could return the object without change if it is not a masked array, and if a masked array make the appropriate changes to return a nd_array or raise exception. A simple standard for indicating a function's the ability to handle masked data would be to include a mask_action parameter which holds or indicates a function for processing missing data. Also, are there any current plans to allow record type arrays to be masked? Thanks, Mike
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
I am posting a reply to my own post in a hope to generate some discussion of the original proposal. I am proposing to add a "filled" method to ndarray. This can be a pass-through, an alias to "copy" or a method to replace nans or some other type-specific values. This will allow code that uses "filled" work on ndarrays without changes. On 3/22/06, Sasha <ndarray@mac.com> wrote:
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
In general, I'm skeptical of adding more methods to the ndarray object -- there are plenty already. In addition, it appears that both the method and function versions of filled are "dangerous" in the sense that they sometimes return the array itself and sometimes a copy. Finally, changing ndarray to support masked array feels a bit like the tail wagging the dog. Let me throw out an alternative proposal. I will admit up front that this proposal is based on exactly zero experience with masked array, so there may be some stupidities in it, but perhaps it will lead to an alternative solution. def asUnmaskedArray(obj, fill_value=None): mask = getattr(obj, False) if mask is False: return obj if fill_value is None: fill_value = obj.get_fill_value() newobj = obj.data().copy() newobj[mask] = fill_value return newobj Or something like that anyway. This particular version should work on any array as long as if it exports a mask attribute it also exports get_fill_value and data. At least once any bugs are ironed out, I haven't tested it. ma would have to be modified to use this instead of using filled everywhere, but that seems more appropriate than tacking on another method to ndarray IMO. On advantage of this approach is that most array like objects that don't subclass ndarray will work with this automagically. If we keep expanding the methods of ndarray, it's harder and harder to implement other array like objects since they have to implement more and more methods, most of which are irrelevant to their particular case. The more we can implement stuff like this in terms of some relatively small set of core primitives, the happier we'll all be in the long run. This also builds on the idea of trying to push as much of the array/view ambiguity into the asXXXArray corner. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
I've also proposed to drop "fill" in favor of optimizing x[...] = <scalar>. Having both "fill" and "filled" in the interface is plain awkward. You may like the combined proposal better because it does not change the total number of methods :-) In addition, it appears that both the method and function versions of
filled are "dangerous" in the sense that they sometimes return the array itself and sometimes a copy.
This is true in ma, but may certainly be changed.
Finally, changing ndarray to support masked array feels a bit like the tail wagging the dog.
I disagree. Numpy is pretty much alone among the array languages because it does not have "native" support for missing values. For the floating point types some rudimental support for nans exists, but is not really usable. There is no missing values machanism for integer types. I believe adding "filled" and maybe "mask" to ndarray (not necessarily under these names) could be a meaningful step towards "native" support for missing values.
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Sasha wrote:
I agree strongly with you, Sasha. I get the impression that the world of numerical computation is divided into those who work with idealized "data", where nothing is missing, and those who work with real observations, where there is always something missing. As an oceanographer, I am solidly in the latter category. If good support for missing values is not built in, it has to be bolted on, and it becomes clunky and awkward. I was reluctant to speak up about this earlier because I thought it was too much to ask of Travis when he was in the midst of putting numpy on solid ground. But I am delighted that missing value support has a champion among numpy developers, and I agree that now is the time to change it from "bolted on" to "integrated". Eric
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Eric Firing wrote:
I think your experience is clouding your judgement here. Or at least this comes off as unnecessarily perjorative. There's a large class of people who work with data that doesn't have missing values either because of the nature of data acquisition or because they're doing simulations. I take zillions of measurements with digital oscillopscopes and they *never* have missing values. Clipped values, yes, but even if I somehow could queery the scope about which values were actually clipped or simply make an educated guess based on their value, the facilities of ma would be useless to me. The clipped values are what I would want in any case. I also do a lot of work with simulations derived from this and other data. I don't come across missing values here but again, if I did, the way ma works would not help me. I'd have to treat them either by rejecting the data outright or by some sort of interpolation.
This may be a false dichotomy. It's certainly not obvious to me that this is so. At least if "bolted on" means "not adding a filled method to ndarray".
I have no objection to ma support improving. In fact I think it would be great although I don't forsee it helping me anytime soon. I also support Sasha's goal of being able to mix MaskedArrays and ndarrays reasonably seemlessly. However, I do think the situation needs more thought. Slapping filled and mask onto ndarray is the path of least resistance, but it's not clear that it's the best one. If we do decide we are going to add both of these methods to ndarray (with filled returning a copy!), then it may worth considering making ndarray a subclass of MaskedArray. Conceptually this makes sense, since at this point an ndarray will just be a MaskedArray where mask is always False. I think that they could share much of the implementation except that ndarray would be set up to use methods that ignored the mask attribute since they would know that it's always false. Even that might not be worth it, since the check for whether mask is True/False is just a pointer compare. It may in fact be best just to do away with MaskedArray entirely, moving the functionality into ndarray. That may have performance implications, although I don't seem them at the moment, and I don't know if there are other methods/attributes that this would imply need to be moved over, although it looks like just mask, filled and possibly filled_value, although the latter looks a little dubious to me. Either of the above two options would certainly improve the quality of MaskedArray. Copy for instance seems not to have been implemented, and who knows what other dark corners remain unexplored here. There's a whole spectrum of possibilities here from ones that don't intrude on ndarray at all to ones that profoundly change it. Sasha's suggestion looks like it's probably the simplest thing in the short term, but I don't know that it's the best long term solution. I think it needs more thought and discussion, which is after all what Sasha asked for ;) Regards, -tim
![](https://secure.gravatar.com/avatar/7e9e53dbe9781722d56e308c32387078.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
Tim, The point is well-taken, and I apologize. I stated my case badly. (I would be delighted if I did not have to be concerned with missing values-they are a pain regardless of how well a numerical package handles them.)
I probably overstated it, but I think we actually agree. I intended to lend support to the priority of making missing-value support as seamless and painless as possible. It will help some people, and not others.
This is exactly the option that I was afraid to bring up because I thought it might be too disruptive, and because I am not contributing to numpy, and probably don't have the competence (or time) to do so.
Exactly! Thank you for broadening the discussion. Eric
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
Completely agree. I have many gripes about current ma implementation of both "filled" and "mask". filled: 1. I don't like default fill value. It should be mandatory to supply fill value. 2. It should return masked array (with trivial mask), not ndarray. 3. The name conflicts with the "fill" method. 4. View/Copy inconsistency. Does not provide a method to fill values in-place. mask: 1. I've got rid of mask returning None in favor of False_ (boolean array scalar), but it is still not perfect. I would prefer data.shape == mask.shape invariant and if space saving/performance is deemed necessary use zero-stride arrays. 2. I don't like the name. "Missing" or "na" would be better.
The tail becoming the dog! Yet I agree, this makes sense from the implementation point of view. From OOP perspective this would make sense if arrays were immutable, but since mask is settable in MaskedArray, making it constant in the subclass will violate the substitution principle. I would not object making mask read only, however.
I think MA can coexist with ndarray and share the interface. Ndarray can use special bit-patterns like IEEE NaN to indicate missing floating point values. Add-on modules can redefine arithmetic to make INT_MIN behave as a missing marker for signed integers (R, K and J (I think) languages use this approach). Applications that need missing values support across the board will use MA.
More (corners) than you want to know about! Reimplementing MA in C would be a worthwhile goal (and what you suggest seems to require just that), but it is too big of a project. I suggest that we focus on the interface first. If existing MA interface is rejected (which is likely) for ndarray, we can easily experiment with the alternatives within MA, which is pure python.
Exactly!
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
3. The name conflicts with the "fill" method. fillmask ? clog ?
Er... How many of us are using MA on a regular basis ? Aren't we a minority ? It'd seem wiser to adapt MA to numpy, in Python (but maybe that's the XIXe French integration model I grew up with that makes me talk here...)
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
That makes perfect sense. If anything should have a default fill value, it's the functsion calling filled, not the arrays themselves.
2. It should return masked array (with trivial mask), not ndarray.
So, just with mask = False? In a follow on message Pierre disagress and claims that what you really want is the ndarray since not everything will accept. Then I guess you'd need to call b.filled(fill).data. I agree with Sasha in principle but Pierre, perhaps in practice. I'm almost suggested it get renames a.asndarray(fill), except that asXXX has the wrong conotations. I think this one needs to bounce around some more.
3. The name conflicts with the "fill" method.
I thought you wanted to kill that. I'd certainly support that. Can't we just special case __setitem__ for that one case so that the performance is just as good if performance is really the issue?
4. View/Copy inconsistency. Does not provide a method to fill values in-place.
b[b.mask] = fill_value; b.unmask() seems to work for this purpose. Can we just have filled return a copy?
Interesting idea. Is that feasible yet?
2. I don't like the name. "Missing" or "na" would be better.
I'm not on board here, although really I'd like to here from other people who use the package. 'na' seems to cryptic to me and 'missing' to specific -- there might be other reasons to mask a value other it being missing. The problem with mask is that it's not clear whether True means the data is useful or unuseful. Keep throwing out names, maybe one will stick.
How do you set the mask? I keep getting attribute errors when I try it. And unmask would be a noop on an ndarray.
Perhaps MaskedArray should inherit from ndarray for the time being. Many of the methods would need to reimplemented anyway, but it would make asanyarray work. Someone was just complaining about asarray munging his arrays. That's correct behaviour, but it would be nice if asanyarray did the right thing. I suppose we could just special case asanyarray to ignore MaskedArrays, that might be better since it's less constraining from an implementation side too.
This may be an oportune time to propose something that's been cooking in the back of my head for a week or so now: A stripped down array superclass. The details of this are not at all locked down, but here's a strawman proposal. We add an array superclass. call it basearray, that has the same C-structure as the existing ndarray. However, it has *no* methods or attributes. It's simply a big blob of data. Functions that work on the C structure of arrays (ufuncs, etc) would still work on this arrays, as would asarray, so it could be converted to an ndarray as necessary. In addition, we would supply a minimal set of functions that would operate on this object. These functions would be chosen so that the current array interface could be implemented on top of them and the basearray object in pure python. These functions would be things like set_shape(a, shape), etc. They would be segregated off in their own namespace, not in the numpy core. [Note that I'm not proposing we actually implement ndarray this way, just that we make is possible]. This leads to several useful outcomes. 1. If we're careful, this could be the basic array object that we propose, at least for the first roun,d for inclusion in the Python core. It's not useful for anything but passing data betwen various application that understand the data structure, but that in itself could be a huge win. And the fact that it's dirt simple would probably be an advantage to getting it into the core. 2. It provides a useful marker class. MA could inherit from it (and use itself for it's data attribute) and then asanyarray would behave properly. MA could also use this, or a subclass, as the mask object preventing anyone from accidentally using it as data (they could always use it on purpose with asarray). 3. It provides a platform for people to build other, ndarray-like classes in Pure python. This is my main interest. I've put together a thin shell over numpy that strips it down to it's abolute essentials including a stripped down version of ndarray that removes most of the methods. All of the __array_wrap__[1] stuff works quite well most of the time, but there's still some issues with being a subclass when this particular class is conceptually a superclass. If we had an array superclass of some sort, I believe that these would be resolved. In principle at least, this shouldn't be that hard. I think it should mostly be rearanging some code and adding some wrappers to existing functions. That's in principle. In practice, I'm not certain yet as I haven't investigated the code in question in much depth yet. I've been meaning to write this up into a more fleshed out proposal, but I got distracted by the whole Protocol discussion on python-dev3000. This writeup is pretty weak, but hopefully you get the idea. Anyway, this is somethig that I would be willing to put some time on that would benefit both me and probably the MA folks as well. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
Just for the record. Currently MA does not inherit from ndarray. There are some benefits to be gained from changing MA design from containment to inheritance, by I am very sceptical about the use of inheritance in the array setting.
This is a very worthwhile idea and I hate to see it burried in a non-descriptive thread. I've copied your proposal to the wiki at <http://projects.scipy.org/scipy/numpy/wiki/ArraySuperClass>.
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Sasha wrote:
Right, I checked that. That's why asanyarray won't work now with MA (unless someone changed the implementation of that while I wan't looking.
That's probably a sensible position. Still it would be nice to have asanyarray pass masked arrays through somehow. I haven't thought this through very well, but I wonder if it would make sense for asanyarray to pass any object that supplies __array__. I'm leary of special casing asanyarray just for MA; somehow that seems the wrong approach.
Thanks for doing that. I'm glad you like the general idea. I do plan to write it through and try to get a better handle on what this would entail and what the consequences would be. However, I'm not sure exactly when I'll get around to it so it's probably better that a rough draft be out there for people to think about in the interim. -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Tim Hochberg <tim.hochberg@cox.net> wrote:
It looks like we are getting close to a consensus on this one. I will remove fill_value attribute. [...]
I'll propose a patch.
+1
It is not feasible in pure python module like ma, but easy in ndarray. We can also reset the writeable flag to avoid various problems that zero strides may cause. I'll propose a patch.
The problem with the "mask" name is that ndarray already has unrelated "putmask" method. On the other hand putmask is redundant with fancy indexing. I have no other problem with "mask" name, so we may just decide to get rid of "putmask".
[...] How do you set the mask? I keep getting attribute errors when I try it.
a[i] = masked makes i-th element masked. If mask is an array, you can just set its elements.
And unmask would be a noop on an ndarray.
Yes. [...]
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
Well, if 'mask' became a default argument of ndarray, that wouldn't be a pb any longer. I'm quite for that.
tondarray(fill) ?
Yes !
The problem with mask is that it's not clear whether True means the data is useful or unuseful.
I have to think twice all the time I want to create a mask that True means in fact that I don't want the data, whereas True selects the data for ndarray...
"putmask" really seems overkill indeed. I wouldn't miss it.
How do you set the mask? I keep getting attribute errors when I try it. And unmask would be a noop on an ndarray.
I've implemented something like that for some classes (inheriting from MA.MaskedArray). Never really used it yet, though #-------------------------------------------- def applymask(self,m): if not MA.is_mask(m): raise MA.MAError,"Invalid mask !" elif self._data.shape != m.shape: raise MA.MAError,"Mask and data not compatible." else: self._dmask = m
That'd be great indeed, and may solve some problems reported on th list about subclassing ndarray. AAMOF, I gave up trying to use ndarray as a superclass, and rely only on MA
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Pierre GM wrote:
decide to get rid of "putmask".
"putmask" really seems overkill indeed. I wouldn't miss it.
I'm not opposed to getting rid of putmask either. Several of the newer methods are open for discussion before 1.0. I'd have to check to be sure, but .take and .put are not entirely replaced by fancy-indexing. Also, fancy indexing has enough overhead that a method doing exactly what you want is faster. -Travis
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
This is in essence what I've been proposing since SciPy 2005. I want what goes into Python to be essentially just this super-class. Look at this http://numeric.scipy.org/array_interface.html and check out this svn co http://svn.scipy.org/svn/PEP arrayPEP I've obviously been way over-booked to do this myself. Nick Coughlan expressed interest in this idea (he called it dimarray, but I like basearray better).
Why not give it the attributes corresponding to it's C-structure. I'm happy with no methods though.
The only extra thing I'm proposing is to add the data-descriptor object into the Python core as well --- other-wise what do you do with PyArray_Descr * part of the C-structure?
This is exactly what needs to be done to improve array-support in Python. This is the conclusion I came to and I'm glad to see that Tim is now basically having the same conclusion. There are obviously some details to work out. But, having a base structure to inherit from would be perfect. -Travis
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
I'll look these over. I suppose I should have been paying more attention before!
Mainly because I didn't want too much about whether a given method or attribute was a good idea and I was in a hurry when I tossed that proposal out. It seemed better to start with the most stripped down proposal I could come up and see what people demanded I add.. I'm actually sort of inclined to give it *read-only* attribute associated with C-structure, but no methods. That way you can examine the shape, type, etc but you can't set them [I'm specifically thinking of shape here, but there may be others].. I think that there are cases where you don't want the base array to be mutable at all, but I don't think introspection should be a problem. If the attributes were setabble, you could always override the them with readonly properties, but it'd be cleaner to just start with readonly functionality and add setability (is that a word?) only in those cases where it's needed.
Good point.
Hmm. This idea seems to have a fair bit of consensus behind it. I guess that means I better looking into exactly what it would take to make it work. The details of what attributes to expose, etc are probably not too important to work out immediately. Regards, -tim
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
Folks, I'm more or less in Eric's field (hydrology), and we do have to deal with missing values, that we can't interpolate straightforwardly (that is, without some dark statistical magic). Purely discarding the data is not an option either. MA fills the need, most of it. I think one of the issues is what is meant by 'masked data': - a missing observation ? - a NAN ? - a data we don't want to consider at one particular point ? For the last point, think about raster maps or bitmaps: calculations should be performed on a chunk of data, the initial data left untouched, and the result should both have the same size as the original, and valid only on the initial chunk. The current MA implementation, with its _data part and is _mask part, works nicely for the 3rd point. - I wonder whether implementing a 'filled' method for ndarrays is really better than letting the user create a MaskedArray, where the NANs are masked.In any case, a 'filled' method should always return a copy, as it's no longer the initial data. - I'm not sure what to do with the idea of making ndarray a subclass of MA . One on side, Tim pointed rightly that a ndarray is just a MA with a 'False' mask. Actually, I'm a bit frustrated with the standard 'asarray' that shows up in many functions. I'd prefer something like "if the argument is a non-numpy sequence (tuples,lists), transforming it in a ndarray, but if it's already a ndarray or a MA, leave it as it is. Don't touch the mask if present". That's how MA.asarray works, but unfortunately the std "asarray" gets rid of the mask (and you end up with something which is not what you'd expect). A 'mask=False' attribute in ndarray would be nice. On another, some methods/functions make sense only on unmasked ndarray (FFT, solving equations), some others are a bit tricky to implement (diff ? median...). Some exception could be raised if the arguments of these functions return True with ismasked (cf below), or that could be simplified if 'mask' was a default attribute of numarrays. I regularly have to use a ismasked function (cf below). def ismasked(a): if hasattr(a,'mask'): return a.mask.any() else: return False We're going towards MA as the default object. But then again, what would be the behavior to deal with missing values ? Using R-like na.actions ? That'd be great, but it's getting more complex. Oh, and another thing: if 'mask', or 'masked' becomes a default attribute of ndarrays, how do we define a mask? As a boolean ndarray whose 'mask' is always 'False' ? How do you __repr__ it ? - I agree that 'filled_value' is not very useful. If I want to fill an array, I'm happy to specify what value I want it filled with. In facts, I'd be happier to specifiy 'values'. I often have to work with 2D arrays, each column representing a different variable. If this array has to be filled, I'd like each column to be filled with one particular value, not necessarily the same along all columns: something like column_stack([A[:,k].filled(filler[k]) for k in range(A.shape[1])]) with filler a 1xA.shape[1] array of filling values. Of course, we could imagine the same thing for rows, or higher dimensions... Sorry for the rants...
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/7/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
... We're going towards MA as the default object.
I will be against changing the array structure to handle missing values. Let's keep the discussion focuced on the interface. Once we agree on the interface, it will be clear if any structural changes are necessary.
But then again, what would be the behavior to deal with missing values ?
We can postpone this discussion as well. Just add mask attribute that returns False and filled method that returns a copy is an example of a minimalistic change.
Using R-like na.actions ? That'd be great, but it's getting more complex.
I don't like na.actions. I think missing values should behave like IEEE NaNs and in the floating point case should be represented by NaNs. The functionality provided by na.actions can always be achieved by calling an extra function (filled or compress).
See above. For ndarray mask is always False unless an add-on module is loaded that redefines arithmetic to recognize special bit-patterns such as NaN or INT_MIN.
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, On 4/7/06, Sasha <ndarray@mac.com> wrote:
I think that the usage of MA is important because this often dictates the interface. The other aspect is the penalty that is imposed by requiring a masked features especially to situations that don't need any of these features.
I think the issue related to how masked values should be handled in computation. Does it matter if the result of an operation is due to a masked value or numerical problem (like dividing by zero)? (I am presuming that it is possible to identify this difference.) If not, then I support the idea of treating masked values as NaN.
The functionality provided by na.actions can always be achieved by calling an extra function (filled or compress).
I am not clear on what you actually mean here. For example, if you are summing across a particular dimension, I would presume that any masked value would be ignored an that there would be some record of the fact that a masked value was encountered. This would allow that 'extra function' to handle the associated result. Alternatively the 'extra function' would have to be included as an argument - which is what the na.actions do. Regards Bruce
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Bruce Southey <bsouthey@gmail.com> wrote:
IEEE standard prvides plenty of spare bits in NaNs to represent pretty much everything, and some languages take advantage of that feature. (I believe NA and NaN are distinct in R). In MA, however mask elements are boolean and no distinction is made between various reasons for not having a data element. For consistency, a non-trivial (not always false) implementation of ndarray.mask should return "not finite" and ignore bits that distinguish NaNs and infinities.
If you sum along a particular dimension and encounter a masked value, the result is masked. The same is true if you encounter a NaN - the result is NaN. If you would like to ignore masked values, you write a.filled(0).sum() instead of a.sum(). In 1d case, you can also use a.compress().sum(). In other words, what in R you achieve with a flag, such as in sum(a, na.rm=TRUE), in numpy you achieve by an explicit call to "fill". This is not quite the same as na.actions in R, but that is what I had in mind.
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
If you sum along a particular dimension and encounter a masked value, the result is masked.
That's not how it currently works (still on 0.9.6): x=arange(12).reshape(3,4) MA.masked_where((x%5==0) | (x%3==0),x).sum(0) array(data = [12 1 2 18], mask = [False False False False], fill_value=999999) and frankly, I'd be quite frustrated if it had to change: - `filled` is not a ndarray method, which means that a.filled(0).sum() fails if a is not MA. Right now, I can use a.sum() without having to check the nature of a first. - this behavior was already in Numeric - All my scripts rely on it (but I guess that's my problem) - The current way reflects how mask are used in GIS or image processing.
Once again, Sasha, I'd agree with you if it wasn't a major difference
I kinda like the idea of a flag, though
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
ma.array([1,1], mask=[0,1]).sum() 1
This is exactly the point of the current discussion: make fill a method of ndarray. With the current behavior, how would you achieve masking (no fill) a.sum()?
- this behavior was already in Numeric
That's true, but it makes the result of sum(a) different from __builtins__.sum(a). I believe consistency with the python conventions is more important than with legacy Numeric in the long run.
[...]
- The current way reflects how mask are used in GIS or image processing.
Can you elaborate on this? Note that in R na.rm is false by default in sum:
sum(c(1,NA)) [1] NA
So it looks like the convention is different in the field of statistics.
Array methods are a very recent addition to ma. We can still use this window of opportunity to get things right before to many people get used to the wrong behavior. (Note that I changed your implementation of cumsum and cumprod.)
With the flag approach making ndarray and ma.array interfaces consistent would require adding an extra argument to many methods. Instead, I poropose to add one method: fill to ndarray.
![](https://secure.gravatar.com/avatar/7fdc9b298a3db0dda9ab44a306959baa.jpg?s=120&d=mm&r=g)
MA.array([[1,1],[1,1]],mask=[[0,1],[1,0]]).sum() array(data = [1 1], mask = [False False], fill_value=999999)
MA.array([[1,1],[1,1]],mask=[[0,1],[1,1]]).sum() array(data = [1 999999], mask = [False True], fill_value=999999) With a.filled(0).sum(), how would you distinguish between the cases (a) at least one value is not masked and (b) all values are masked ? (OK, by querying the mask with something in the line of a a._mask.all(axis), but it's longer... Oh well, I'll just to adapt)
Good points... We'll just have to put strong warnings everywhere.
MMh. *digs in his old GRASS scripts* OK, my bad. I had to fill missing values somehow, or at least check whether there were any before processing. I'll double check on that. Please temporarily forget that comment.
On a semantic aspect: While digging these GRASS scripts I mentioned, I realized/remembered that masked values are called 'null', when there's no data, a NAN, or just when you want to hide some values. What about 'null' instead of 'mask','missing','na' ?
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Pierre GM wrote:
Any number of reasons I would think. It depends on what your using the data for. If the sum is the total amount that you spent in the month, and a masked value means you lost that check stub, then you don't know how much you actually spent and that value should be masked. To chose a boring example.
Actually I'm going to ask you the same question. Why would care if all of the values are masked? I may be missing something, but either there's a sensible default value, in which case it doesn't matter how many values are masked, or you can't handle any masked values and the result should be masked if there are any masks in the input. Sasha's proposal handle those two cases well. Your behaviour a little more clunkily, but I'd like to understand why you want that behaviour. Regards, -tim
![](https://secure.gravatar.com/avatar/6c32e3d6cb67dac69ef7a3504b187a7c.jpg?s=120&d=mm&r=g)
OK, now I get it :)
I understand that, and I eventually agree it should be the default.
Masked values are not necessarily nans or missing. I quite regularly mask values that do not satisfy a given condition. For various reasons, I can't compress the array, I need to preserve its shape. With the current behavior, a.sum() gives me the sum of the values that satisfy the condition. If there's no such value, the result is masked, and that way I know that the condition was never met. Here, I could use Sasha's method combined with a._mask.all, no problem Another example: let x a 2D array with missing values, to be normalized along one axis. Currently, x/x.sum() give the result I want (provided it's true division). Sasha's method would give me a completely masked array.
Your points are quite valid. I'm just worried it's gonna break a lot of things in the next future. And where do we stop ? So, if we follow Sasha's way: x.prod() should be the same, right ? What about a.min(), a.max() ? a.mean() ?
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, My view is solely as user so I really do appreciate the thought that you all are putting into this! I am somewhat concerned that having to use filled() is an extra level of complexity and computational burden. For example, in computing the mean/average I using filled would require a one effort to get the sum and another to count the non-masked elements. For at least summation would it make more sense to add an optional flag(s) such that there appears little difference between a normal array and a masked array? For example, a.sum() is the current default a.sum(filled_value=x) where x is some value such as zero or other user defined value. a.sum(ignore_mask=True) or similar to address whether or not masked values should be used. I am also not clear on what happens with other operations or dimensions. Regards Bruce On 4/10/06, Pierre GM <pierregm@engr.uga.edu> wrote:
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
As I understand it, the goal that Sasha is pursuing here is to make masked arrays and normal arrays interchangeable as much as practical. I believe that there is reasonable consensus that this is desirable. Sasha has proposed a compromise solution that adds minimal attributes to ndarray while allowing a lot of interoperability between ma and ndarray. However it has it's clunky aspects as evidenced by the pushback he's been getting from masked array users. Here's one example. In the masked array context it seems perfectly reasonable to pass a fill value to sum. That is: x.sum(fill=0.0) But, if you want to preserve interoperability, that means you have to add fill arguments to all of the ndarray methods and what do you have? A mess! Particularly is some *other* package comes along that we decide is important to support in the same manner as ma. Then we have another set of methods or keyword args that we need to tack on to ndarray. Ugh! However, I know who, or rather what, to blame for our problems: the object-oriented hype industry in general and Java in particular <0.1 wink>. Why? Because the root of the problem here is the move from functions to methods in numpy. I appreciate a nice method as much as the nice person, but they're not always better than the equivalent function and in this case they're worse. Let's fantasize for a minute that most of the methods of ndarray vanished and instead we went back to functions. Just to show that I'm not a total purist, I'll let the mask attribute stay on both MaskedArray and ndarray. However, filled bites the dust on *both* MaskedArray and ndarray just like the rest. How would we deal with sum then? Something like this: # ma.py def filled(x, fill): x = x.copy() if x.mask is not False: x[x.mask] = value x.umask() return x def sum(x, axis, fill=None): if fill is not None: x = filled(x, fill) # I'm blowing off the correct treatment of the fill=None case here because I'm lazy return add.reduce(x, axis) # numpy.py (or __init__ or oldnumeric or something) def sum(x, axis): if x.mask is not False: raise ValueError("use ma.sum for masked arrays") return add.reduce(x, axis) [Fixing the fill=None case and dealing correctly dtype is left as an exercise for the reader.] All of the sudden all of the problems we're running into go away. Users of masked arrays simply use the functions from ma and can use ndarrays and masked arrays interchangeably. On the other hand, users of non-masked arrays aren't burdened with the extra interface and if they accidentally get passed a masked array they quickly find about it (you don't want to be accidentally using masked arrays in an application that doesn't expect them -- that way lies disaster). I realize that railing against methods is tilting at windmills, but somehow I can't help myself ;-| Regards, -tim
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
[Tim rant's a lot] Just to be clear, I'm not advocating getting rid of methods. I'm not advocating anything, that just seems to get me into trouble ;-) I still blame Java though. Regards, -tim
![](https://secure.gravatar.com/avatar/837d314801b4f1400d6eabc767ca2cac.jpg?s=120&d=mm&r=g)
On 4/10/06, Pierre GM <pgmdevlist@mailcan.com> wrote:
I am just making your point with a shorter example.
It looks like there is little opposition here. I'll submit a patch soon and unless better names are suggested, it will probably go in.
With the current behavior, how would you achieve masking (no fill) a.sum()? Er, why would I want to get MA.masked along one axis if one value is masked ?
Because if you don't know one of the addends you don't know the sum. Replacing missing values with zeros is not always the right strategy. If you know that your data has non-zero mean, for example, you might want to replace missing values with the mean instead of zero.
I did not realize that, but it is really bad. What is the justification for this? In R:
sum(c(NA,NA), na.rm=TRUE) [1] 0
What does MATLAB do in this case?
Exactly. Explicit is better than implicit. The Zen of Python <http://www.python.org/dev/peps/pep-0020>.
Do you agree with my proposal as long as we have explicit warnings in the documentation that methods behave differently from legacy functions?
[... GIS comment snipped ...]
I don't think "null" returning an array of bools will create a lot of enthusiasm. It sounds more like ma.masked as in a[i] = ma.masked. Besides, there is probably a reason why python uses the name "None" instead of "Null" - I just don't know what it is :-).
![](https://secure.gravatar.com/avatar/25899bc1947e1ce40b16e55631c2c94a.jpg?s=120&d=mm&r=g)
On 4/11/06, Sasha <ndarray@mac.com> wrote:
I feel that in general implicitly replacing masked values will definitely lead to bugs in my code. Unless it is really obvious what the best way to deal with the masked values is for the particular function, then I would definitely prefer to be explicit about it. In most cases there are a number of reasonable options for what can be done. Masking the result when masked values are involved seems the most transparent default option. For example, it gives me a really bad feeling to think that sum will automatically return the sum of all non-masked values. When dealing with large datasets, I will not always know when I need to be careful of missing values. Summing over the non-masked arrays will often not be the appropriate course and I fear that I will not notice that this has actually occurred. If masked values are returned it is pretty obvious what has happened and easily to go back and explicitly handle the masked data in another way if appropriate. Mike
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Sasha wrote:
Supporting missing values is a useful thing (but not for every usage of arrays). Thus, ultimately, I see missing-value arrays as a solid sub-class of the basic array class. I'm glad Sasha is working on missing value arrays and have tried to be supportive. I'm a little hesitant to add a special-case method basically for one particular sub-class, though, unless it is the only workable solution. We are still exploring this whole sub-class space and have not really mastered it... -Travis
![](https://secure.gravatar.com/avatar/ccb440c822567bba3d49d0ea2894b8a1.jpg?s=120&d=mm&r=g)
In article <d38f5330604071219j6a5adbdw4a300ed10a26a445@mail.gmail.com>, Sasha <ndarray@mac.com> wrote:
I completely agree with this. I would really like to see proper native support for arrays with masked values in numpy (such that all ufuncs, functions, etc. work with masked arrays). I would be thrilled to be able to filter masked arrays, for instance. -- Russell
participants (13)
-
Alexander Belopolsky
-
Bruce Southey
-
Christopher Barker
-
Colin J. Williams
-
Eric Firing
-
Fernando Perez
-
Michael Sorich
-
Pierre GM
-
Pierre GM
-
Russell E. Owen
-
Sasha
-
Tim Hochberg
-
Travis Oliphant