[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Thu Jun 23 20:34:11 EDT 2011

On Thu, Jun 23, 2011 at 7:31 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

> On Thu, Jun 23, 2011 at 6:21 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>
>> On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern <robert.kern at gmail.com>
>>> wrote:
>>> > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>> >> Enthought has asked me to look into the "missing data" problem and how
>>> NumPy
>>> >> could treat it better. I've considered the different ideas of adding
>>> dtype
>>> >> variants with a special signal value and masked arrays, and concluded
>>> that
>>> >> adding masks to the core ndarray appears is the best way to deal with
>>> the
>>> >> problem in general.
>>> >> I've written a NEP that proposes a particular design, viewable here:
>>> >>
>>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
>>> >> There are some questions at the bottom of the NEP which definitely
>>> need
>>> >> discussion to find the best design choices. Please read, and let me
>>> know of
>>> >> all the errors and gaps you find in the document.
>>> >
>>> > One thing that could use more explanation is how your proposal
>>> > improves on the status quo, i.e. numpy.ma. As far as I can see, you
>>> > are mostly just shuffling around the functionality that already
>>> > exists. There has been a continual desire for something like R's NA
>>> > values by people who are very familiar with both R and numpy's masked
>>> > arrays. Both have their uses, and as Nathaniel points out, R's
>>> > approach seems to be very well-liked by a lot of users. In essence,
>>> > *that's* the "missing data problem" that you were charged with: making
>>> > happy the users who are currently dissatisfied with masked arrays. It
>>> > doesn't seem to me that moving the functionality from numpy.ma to
>>> > numpy.ndarray resolves any of their issues.
>>>
>>> Speaking as a user who's avoided numpy.ma, it wasn't actually because
>>> of the behavior I pointed out (I never got far enough to notice it),
>>> but because I got the distinct impression that it was a "second-class
>>> citizen" in numpy-land. I don't know if that's true. But I wasn't sure
>>> how solidly things like interactions between numpy and masked arrays
>>> worked, or how , and it seemed like it had more niche uses. So it just
>>> seemed like more hassle than it was worth for my purposes. Moving it
>>> into the core and making it really solid *would* address these
>>> issues...
>>>
>>
>> These are definitely things I'm trying to address.
>>
>> It does have to be solid, though. It occurs to me on further thought
>>> that one major advantage of having first-class "NA" values is that it
>>> preserves the standard looping idioms:
>>>
>>> for i in xrange(len(x)):
>>>  x[i] = np.log(x[i])
>>>
>>> According to the current proposal, this will blow up, but np.log(x)
>>> will work. That seems suboptimal to me.
>>>
>>
>> This boils down to the choice between None and a zero-dimensional array as
>> the return value of 'x[i]'. This, and the desire that 'x[i] == x[i]' should
>> be False if it's a masked value have convinced me that a zero-dimensional
>> array is the way to go, and your example will work with this choice.
>>
>>
>>>
>>> I do find the argument that we want a general solution compelling. I
>>> suppose we could have a magic "NA" value in Python-land which
>>> magically triggers fiddling with the mask when assigned to numpy
>>> arrays.
>>>
>>> It's should also be possible to accomplish a general solution at the
>>> dtype level. We could have a 'dtype factory' used like:
>>>  np.zeros(10, dtype=np.maybe(float))
>>> where np.maybe(x) returns a new dtype whose storage size is x.itemsize
>>> + 1, where the extra byte is used to store missingness information.
>>> (There might be some annoying alignment issues to deal with.) Then for
>>> each ufunc we define a handler for the maybe dtype (or add a
>>> special-case to the ufunc dispatch machinery) that checks the
>>> missingness value and then dispatches to the ordinary ufunc handler
>>> for the wrapped dtype.
>>>
>>
>> The 'dtype factory' idea builds on the way I've structured datetime as a
>> parameterized type, but the thing that kills it for me is the alignment
>> problems of 'x.itemsize + 1'. Having the mask in a separate memory block is
>> a lot better than having to store 16 bytes for an 8-byte int to preserve the
>> alignment.
>>
>
> Yes, but that assumes it is appended to the existing types in the dtype
> individually instead of the dtype as a whole. The dtype with mask could just
> indicate a shadow array, an alpha channel if you will, that is essentially
> what you are already doing but just probide a different place to track it.
>

This would seem to change the definition of a dtype - currently it
represents a contiguous block of memory. It doesn't need to use all of that
memory, but the dtype conceptually owns it. I kind of like it that way,
where the whole strides idea with data being all over memory space belonging
to ndarray, not dtype.

-Mark

> This would require fixing the issue where ufunc inner loops can't
>>> actually access the dtype object, but we should fix that anyway :-).
>>>
>>
>> Certainly true!
>>
>>
> Chuck
>
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110623/e598acb8/attachment.html>