[Numpy-discussion] Masked Array for NumPy 1.7

Nathaniel Smith njs at pobox.com
Sat May 19 14:23:16 EDT 2012


On Sat, May 19, 2012 at 5:45 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Sat, May 19, 2012 at 10:02 AM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Sat, May 19, 2012 at 9:21 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>>>
>>> On Sat, May 19, 2012 at 10:00 AM, David Cournapeau <cournape at gmail.com>
>>> wrote:
>>>>
>>>> On Sat, May 19, 2012 at 3:17 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>>>
>>>>> On Fri, May 18, 2012 at 3:47 PM, Travis Oliphant <travis at continuum.io>
>>>>> wrote:
>>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> After reading all the discussion around masked arrays and getting
>>>>>> input from as many people as possible, it is clear that there is still
>>>>>> disagreement about what to do, but there have been some fruitful discussions
>>>>>> that ensued.
>>>>>>
>>>>>> This isn't really new as there was significant disagreement about what
>>>>>> to do when the masked array code was initially checked in to master.   So,
>>>>>> in order to move forward, Mark and I are going to work together with
>>>>>> whomever else is willing to help with an effort that is in the spirit of my
>>>>>> third proposal but has a few adjustments.
>>>>>>
>>>>>> The idea will be fleshed out in more detail as it progresses, but the
>>>>>> basic concept is to create an (experimental) ndmasked object in NumPy 1.7
>>>>>> and leave the actual ndarray object unchanged.   While the details need to
>>>>>> be worked out here,  a goal is to have the C-API work with both ndmasked
>>>>>> arrays and arrayobjects (possibly by defining a base-class C-level structure
>>>>>> that both ndarrays inherit from).     This might also be a good way for Dag
>>>>>> to experiment with his ideas as well but that is not an explicit goal.
>>>>>>
>>>>>> One way this could work, for example is to have PyArrayObject * be the
>>>>>> base-class array (essentially the same C-structure we have now with a
>>>>>> HASMASK flag). Then, the ndmasked object could inherit from PyArrayObject *
>>>>>> as well but add more members to the C-structure.     I think this is the
>>>>>> easiest thing to do and requires the least amount of code-change.      It is
>>>>>> also possible to define an abstract base-class PyArrayObject * that both
>>>>>> ndarray and ndmasked inherit from.     That way ndarray and ndmasked are
>>>>>> siblings even though the ndarray would essentially *be* the PyArrayObject *
>>>>>> --- just with a different type-hierarchy on the python side.
>>>>>>
>>>>>> This work will take some time and, therefore, I don't expect 1.7 to be
>>>>>> released prior to SciPy Austin with an end of June target date.   The timing
>>>>>> will largely depend on what time is available from people interested in
>>>>>> resolving the situation.   Mark and I will have some availability for this
>>>>>> work in June but not a great deal (about 2 man-weeks total between us).
>>>>>>  If there are others who can step in and help, it will help accelerate the
>>>>>> process.
>>>>>>
>>>>>
>>>>> This will be a difficult thing for others to help with since the
>>>>> concept is vague, the design decisions seem to be in your and Mark's hands,
>>>>> and you say you don't have much time. It looks to me like 1.7 will keep
>>>>> slipping and I don't think that is a good thing. Why not go for option 2,
>>>>> which will get 1.7 out there and push the new masked array work in to 1.8?
>>>>> Breaking the flow of development and release has consequences, few of them
>>>>> good.
>>>>
>>>>
>>>> Agreed. 1.6.0 was released one year ago already, let's focus on
>>>> polishing what's in there *now*. I have not followed closely what the
>>>> decision was for a LTS release, but if 1.7 is supposed to be it, that's
>>>> another argument about changing anything there for 1.7.
>>>
>>>
>>> The motivation behind splitting the mask out into a separate ndmasked is
>>> primarily so that pre-existing code will not silently function on NA-masked
>>> arrays and produce incorrect results. This centres around using PyArray_DATA
>>> to get at the data after manually checking flags, instead of calling
>>> PyArray_FromAny. Maybe a reasonable solution is to tweak the behavior of
>>> PyArray_DATA? It could work as follows:
>>>
>>> - If an ndarray has no mask, PyArray_DATA returns the data pointer as it
>>> does currently.
>>> - If the ndarray has an NA-mask, PyArray_DATA sets an exception and
>>> returns NULL
>>> - Create a new accessor, PyArray_DATAPTR or PyArray_RAWDATA, which
>>> returns the array data under all circumstances.
>>>
>>> This way, code which currently uses the data pointer through PyArray_DATA
>>> will fail instead of silently working with the wrong interpretation of the
>>> data. What do people feel about this idea?
>>>
>>
>> Code working with the wrong interpretation of the data doesn't bother me
>> much at this point in development. Long term it matters, but in the short
>> term we can't expect code not explicitly written to work with masked arrays
>> to do the right thing. I think we are looking at a period of several years
>> before things settle out and get accepted. First, the implementation and its
>> interface needs to get close to final form, and then the long slow process
>> of adoption into things like matplotlib needs to take place. I'd quess three
>> to five years for that process.
>>
>> That said, my main concern is to move forward and not spend the next year
>> waiting. I see splitting the masked code out as rather like the python types
>> having pointers to sequence/numerical/etc methods, i.e., ndarray then looks
>> something like an abstract class. I don't have a problem with that and it
>> does avoid base object bloat. As to having PyArray_DATA fail for masked
>> arrays and provide new functions for unrestricted access, I'd be tempted to
>> have PyArray_DATA continue to behave as it does and let the new functions
>> return the error for masked arrays. Making third party applications fail for
>> masked arrays is going make masked arrays very unpopular. Most likely no one
>> would use them and third party applications would feel no pressure to
>> support them. Another possibility might be to have a compile flag that
>> determines whether of not PyArray_Data returns an error for masked arrays,
>> something like we do now for deprecating old macros.
>>
>
> My own plan for the near term would be as follows:
>
> 1) Put in the experimental option and get the 1.7 release out. This gets us
> through the next couple of months and keeps things moving.

+1 on not blocking the release while we invent+implement yet another
experimental API.

> 2) Look at what hooks/low level functions would let us reimplement np.ma.
> Because there are so many different mask uses out there, this would be a
> good way to discover what low level support is likely to provide a good
> basis for others to build on.
>
> 3) Revisit the idea of making all ndarrays masked by default, but do so with
> the experience and feedback from current mask users.

I like this plan.

-- Nathaniel



More information about the NumPy-Discussion mailing list