[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 09:57:04 EDT 2011

On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>
>> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> > Enthought has asked me to look into the "missing data" problem and how
>> > NumPy
>> > could treat it better. I've considered the different ideas of adding
>> > dtype
>> > variants with a special signal value and masked arrays, and concluded
>> > that
>> > adding masks to the core ndarray appears is the best way to deal with
>> > the
>> > problem in general.
>> > I've written a NEP that proposes a particular design, viewable here:
>> >
>> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
>> > There are some questions at the bottom of the NEP which definitely need
>> > discussion to find the best design choices. Please read, and let me know
>> > of
>> > all the errors and gaps you find in the document.
>>
>> Wow, that is exciting.
>>
>> I wonder about the relative performance of the two possible
>> implementations (mask and NA) in the PEP.
>
> I've given that some thought, and I don't think there's a clear way to tell
> what the performance gap would be without implementations of both to
> benchmark against each other. I favor the mask primarily because it provides
> masking for all data types in one go with a single consistent interface to
> program against. For adding NA signal values, each new data type would need
> a lot of work to gain the same level of support.
>>
>> If you are, say, doing a calculation along the columns of a 2d array
>> one element at a time, then you will need to grab an element from the
>> array and grab the corresponding element from the mask. I assume the
>> corresponding data and mask elements are not stored together. That
>> would be slow since memory access is usually were time is spent. In
>> this regard NA would be faster.
>
> Yes, the masks add more memory traffic and some extra calculation, while the
> NA signal values just require some additional calculations.

I guess a better example would have been summing along rows instead of
columns of a large C order array. If one needs to look at both the
data and the mask then wouldn't summing along rows in cython be about
as slow as it is currently to sum along columns?

>> I currently use NaN as a missing data marker. That adds things like
>> this to my cython code:
>>
>>    if a[i] == a[i]:
>>        asum += a[i]
>>
>> If NA also had the property NA == NA is False, then it would be easy
>> to use.
>
> That's what I believe it should do, and I guess this is a strike against the
> idea of returning None for a single missing value.

If NA == NA is False then I wouldn't need to look at the mask in the
example above. Or would ndarray have to look at the mask in order to
return NA for a[i]? Which would mean __getitem__ would need to look at
the mask?

If the missing value is returned as a 0d array (so that NA == NA is
False), would that break cython in a fundamental way since it could
not always return a same-sized scalar when you index into an array?

>> A mask, on the other hand, would be more difficult for third
>> party packages to support. You have to check if the mask is present
>> and if so do a mask-aware calculation; if is it not present then you
>> have to do a non-mask based calculation.
>
> I actually see the mask as being easier for third party packages to support,
> particularly from C. Having regular C-friendly values with a boolean mask is
> a lot friendlier than values that require a lot of special casing like the
> NA signal values would require.
>
>>
>> So you have two code paths.
>> You also need to check if any of the input arrays have masks and if so
>> apply masks to the other inputs, etc.
>
> Most of the time, the masks will transparently propagate or not along with
> the arrays, with no effort required. In Python, the code you write would be
> virtually the same between the two approaches.
> -Mark
>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>