[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Mark Wiebe mwwiebe at gmail.com
Sat Jun 25 15:35:40 EDT 2011


On Fri, Jun 24, 2011 at 8:25 PM, Benjamin Root <ben.root at ou.edu> wrote:

> On Fri, Jun 24, 2011 at 8:00 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>
>> On Fri, Jun 24, 2011 at 6:22 PM, Wes McKinney <wesmckinn at gmail.com>wrote:
>>
>>>  On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> >
>>> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett <
>>> matthew.brett at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root <ben.root at ou.edu>
>>> wrote:
>>> >> ...
>>> >> > Again, there are pros and cons either way and I see them very
>>> orthogonal
>>> >> > and
>>> >> > complementary.
>>> >>
>>> >> That may be true, but I imagine only one of them will be implemented.
>>> >>
>>> >> @Mark - I don't have a clear idea whether you consider the nafloat64
>>> >> option to be still in play as the first thing to be implemented
>>> >> (before array.mask).   If it is, what kind of thing would persuade you
>>> >> either way?
>>> >>
>>> >
>>> > Mark can speak for himself,  but I think things are tending towards
>>> masks.
>>> > They have the advantage of one implementation for all data types,
>>> current
>>> > and future, and they are more flexible since the masked data can be
>>> actual
>>> > valid data that you just choose to ignore for experimental  reasons.
>>> >
>>> > What might be helpful is a routine to import/export R files, but that
>>> > shouldn't be to difficult to implement.
>>> >
>>> > Chuck
>>> >
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>> >
>>>
>>> Perhaps we should make a wiki page someplace summarizing pros and cons
>>> of the various implementation approaches? I worry very seriously about
>>> adding API functions relating to masks rather than having special NA
>>> values which propagate in algorithms. The question is: will Joe Blow
>>> Former R user have to understand what is the mask and how to work with
>>> it? If the answer is yes we have a problem. If it can be completely
>>> hidden as an implementation detail, that's great. In R NAs are just
>>> sort of inherent-- they propagate you deal with them when you have to
>>> via na.rm flag in functions or is.na.
>>>
>>
>> I think the interface for how it looks in NumPy can be made to be pretty
>> close to the same with either design approach. I've updated the NEP to add
>> and emphasize using masked values with an np.NA singleton, with the
>> validitymask as the implementation mechanism which is still accessible for
>> those who want to still deal with the mask directly.
>>
>
> I think there are a lot of benefits to this idea, if I understand it
> correctly.  Essentially, if I were to assign np.NA to an element (or a
> slice) of a numpy array, rather than actually assigning that value to that
> spot in the array, it would set the mask to True for those elements?
>

That's correct.

I see a lot of benefits to  this idea.  Imagine in pylab mode (from pylab
> import *), users would have the NA name right in their namespace, just like
> they are used to with R.  And those who want could still mess around with
> masks much like we are used to.  Plus, I think we can still retain the good
> ol' C-pointer to regular data.  My question is this.  Will it be a soft or
> hard mask?  In other words, if I were to assign np.NA to a spot in an array,
> would it kill the value that was there (if it already was initialized)?
> Would I still be able to share masks?
>

Everything so far being discussed has been soft masks, from my understanding
of the soft/hard mask distinction. Whether to support hard masks is still an
open question, with my only use case as being a more reasonable return value
for boolean indexing than is currently used. I don't believe sharing masks
will be possible, because it would directly violate the missing value
abstraction. Can you describe an example where you are sharing masks?

Admittedly, it is a munging of two distinct ideas, but, I think the end
> result would still be the same.
>

It seems to me that the only distinction between using a mask versus a an NA
bit pattern is that the NA bit pattern causes the underlying value to be
destroyed when assigning np.NA. So I'm thinking of the mask as supporting
everything the NA bit pattern does, and a bit more.


>>
>>> The other problem I can think of with masks is the extra memory
>>> footprint, though maybe this is no cause for concern.
>>>
>>
>> The overhead is definitely worth considering, along with the extra memory
>> traffic it generates, and I've basically concluded that the increased
>> generality and flexibility is worth the added cost.
>>
>>
> If we go with the mask approach, one could later try and optimize the
> implementation of the masks to reduce the memory footprint.  Potentially,
> one could reduce the footprint by 7/8ths!  Maybe some sneaky striding tricks
> could help keep from too much cache misses (or go with the approach Pierre
> mentioned which was to calculate them all and let the masks sort them out
> afterwards).
>

This could be very tricky to implement, especially supporting views. I do
see this as a strong argument for completely hiding the mask memory from the
user of the system, only being able to access it by getting a copy. Without
a strong data-hiding approach such as this, the 7/8th memory footprint
optimization is basically precluded.


> As a complete side-thought, I wonder how sparse arrays could play into this
> discussion?
>

Sparse matrices usually mean 0 values aren't stored, but a sparse storage
mechanism for the data being represented by masked arrays when most of the
elements are masked is a closely related and important problem. It doesn't
have much bearing on the present discussion, since the usage patterns have
to be dramatically different to support efficiency.

-Mark


>
> Ben Root
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/4cc0c961/attachment.html>


More information about the NumPy-Discussion mailing list