[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 17:09:16 EDT 2011

On Fri, Jun 24, 2011 at 10:40 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:

> On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>>
>>> Sorry y'all, I'm just commenting bits by bits:
>>>
>>> "One key problem is a lack of orthogonality with other features, for
>>> instance creating a masked array with physical quantities can't be done
>>> because both are separate subclasses of ndarray. The only reasonable way to
>>> deal with this is to move the mask into the core ndarray."
>>>
>>> Meh. I did try to make it easy to use masked arrays on top of subclasses.
>>> There's even some tests in the suite to that effect (test_subclassing). I'm
>>> not buying the argument.
>>> About moving mask in the core ndarray: I had suggested back in the days
>>> to have a mask flag/property built-in ndarrays (which would *really* have
>>> simplified the game), but this suggestion  was dismissed very quickly as
>>> adding too much overload. I had to agree. I'm just a tad surprised the wind
>>> has changed on that matter.
>>>
>>>
>>> "In the current masked array, calculations are done for the whole array,
>>> then masks are patched up afterwords. This means that invalid calculations
>>> sitting in masked elements can raise warnings or exceptions even though they
>>> shouldn't, so the ufunc error handling mechanism can't be relied on."
>>>
>>> Well, there's a reason for that. Initially, I tried to guess what the
>>> mask of the output should be from the mask of the inputs, the objective
>>> being to avoid getting NaNs in the C array. That was easy in most cases,
>>>  but it turned out it wasn't always possible (the `power` one caused me a
>>> lot of issues, if I recall correctly). So, for performance issues (to avoid
>>> a lot of expensive tests), I fell back on the old concept of "compute them
>>> all, they'll be sorted afterwards".
>>> Of course, that's rather clumsy an approach. But it works not too badly
>>> when in pure Python. No doubt that a proper C implementation would work
>>> faster.
>>> Oh, about using NaNs for invalid data ? Well, can't work with integers.
>>>
>>> `mask` property:
>>> Nothing to add to it. It's basically what we have now (except for the
>>> opposite convention).
>>>
>>> Working with masked values:
>>> I recall some strong points back in the days for not using None to
>>> represent missing values...
>>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global
>>> flag like we have now.
>>>
>>> Design questions:
>>> Adding `masked` or whatever we call it to a number/array should result is
>>> masked/a fully masked array, period. That way, we can have an idea that
>>> something was wrong with the initial dataset.
>>> hardmask: I never used the feature myself. I wonder if anyone did. Still,
>>> it's a nice idea...
>>>
>>
>> As a heavy masked_array user, I regret not being able to participate more
>> in this discussion as I am madly cranking out matplotlib code.  I would like
>> to say that I have always seen masked arrays as being the "next step up"
>> from using arrays with NaNs.  The hardmask/softmask/sharedmasked concepts
>> are powerful, and I don't think they have yet to be exploited to their
>> fullest potential.
>>
>
> Do you have some examples where hardmask or sharedmask are being used? I
> like the idea of using a hardmask array as the return value for boolean
> indexing, but some more use cases would be nice.
>
>

At one point I did have something for soft/hard masks, but I think my final
implementation went a different direction.  I would have to look around.  I
do have a good use-case for soft masks.  For a given data, I wanted to
produce several pcolors highlighting different regions.  A soft mask
provided me a quick-n-easy way to change the mask without having to produce
many copies of the original data.

> Masks are (relatively) easy when dealing with element-by-element operations
>> that produces an array of the same shape (or at least the same number of
>> elements in the case of reshape and transpose).  What gets difficult is for
>> reductions such as sum or max, etc.  Then you get into the weirder cases
>> such as unwrap and gradients that I brought up recently.  I am not sure how
>> to address this, but I am not a fan of the idea of adding yet another
>> parameter to the ufuncs to determine what to do for filling in a mask.
>>
>
> It looks like in R there is a parameter called na.rm=T/F, which basically
> means "remove NAs before doing the computation". This approach seems good to
> me for reduction operations.
>
>
Just to throw out some examples where these settings really do not make much
sense.  For gradients and unwrap, maybe you want to skip na's, but still
record the number of points you are skipping or maybe the points at
na-boundaries become na's themselves.  Are we going to have something for
each one of these possibilities?  Of course, this isn't even very well dealt
with in masked arrays right now.

Another example of how we use masks in matplotlib is in pcolor().  We have
to combine the possible masks of X, Y, and V in both the x and y directions
to find the final mask to use for the final output result (because each
facet needs valid data at each corner).  Having a soft-mask implementation
allows one to create a temporary mask to use for the operation, and to share
that mask across all the input data, but then let the data structures retain
their original masks when done.

> Also, just to make things messier, there is an incomplete feature that was
>> made for record arrays with regards to masking.  The idea was to allow for
>> element-by-element masking, but also allow for row-by-row (or was it
>> column-by-column?) masking.  I thought it was a neat feature, and it is too
>> bad that it was not finished.
>>
>
> I put this in my design, I think this would be useful too. I would call it
> field by field, though many people like thinking of the struct dtype fields
> as columns.
>
>
Fields are fine.  I have found that there is no real consistency with how
professionals refer to their rows and columns as "records" and "fields".  I
learned data-handling from working on databases, but my naming convention
often clashes with my some of my committee members who come from a stats
background.

>  Anyway, my opinion is that a mask should be True for a value that needs
>> to be hidden.  Do not change this convention.  People coming into python
>> already has to change code, a simple bit flip for them should be fine.
>> Breaking existing python code is worse.
>>
>
> I'm now thinking the mask needs to be pushed away into the background to
> where it becomes be an unimportant implementation detail of the system. It
> deserves a long cumbersome name like "validitymask", and then the system can
> use something close R's approach with an NA-like singleton for most
> operations.
>

Don't lose sight that we are really talking about two orthogonal (albeit,
seemingly similar) concepts.  "missing" data and "ambiguous" data.  Both of
these tools need to be at the forefront and the distinction needs to be made
clear to the users so that they know which one they need in what situation.
I think hiding masks is a bad idea.  I want numpy to be *better* than R by
offering both features in a clear, non-conflicting manner.

On a note somewhat similar to what I pointing out earlier with regards to
soft masks.  One thing that is very nice about masked_arrays is that I can
at any time turn a regular numpy array into a masked array without paying a
penalty of having to re-assign the data.  Just need to make a separate mask
object.

This is different from how one would operate with a na-dtype approach, where
converting an array with a regular dtype into a na-dtype array would require
a copy.  However, with proper dtype-handling, this may not be of much
concern (non-na-dtype + na-dtype --> na-dtype, much like how int + float -->
float).  Also loading functions could be told to cast to a na-dtype, which
would then result in an array that is ready "out-of-the-box" as opposed to
casting the masked array after the creation of the regular ndarray from a
function like np.loadtxt().

Again, there are pros and cons either way and I see them very orthogonal and
complementary.  Heck, I could even imagine situations where one might want a
mask over an array with a na-dtype.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/b1d79b59/attachment.html>