[Numpy-discussion] Masked Array for NumPy 1.7

Charles R Harris charlesr.harris at gmail.com
Sat May 19 12:45:03 EDT 2012


On Sat, May 19, 2012 at 10:02 AM, Charles R Harris <
charlesr.harris at gmail.com> wrote:

>
>
> On Sat, May 19, 2012 at 9:21 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>
>> On Sat, May 19, 2012 at 10:00 AM, David Cournapeau <cournape at gmail.com>wrote:
>>
>>> On Sat, May 19, 2012 at 3:17 PM, Charles R Harris <
>>> charlesr.harris at gmail.com> wrote:
>>>
>>>> On Fri, May 18, 2012 at 3:47 PM, Travis Oliphant <travis at continuum.io>wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> After reading all the discussion around masked arrays and getting
>>>>> input from as many people as possible, it is clear that there is still
>>>>> disagreement about what to do, but there have been some fruitful
>>>>> discussions that ensued.
>>>>>
>>>>> This isn't really new as there was significant disagreement about what
>>>>> to do when the masked array code was initially checked in to master.   So,
>>>>> in order to move forward, Mark and I are going to work together with
>>>>> whomever else is willing to help with an effort that is in the spirit of my
>>>>> third proposal but has a few adjustments.
>>>>>
>>>>> The idea will be fleshed out in more detail as it progresses, but the
>>>>> basic concept is to create an (experimental) ndmasked object in NumPy 1.7
>>>>> and leave the actual ndarray object unchanged.   While the details need to
>>>>> be worked out here,  a goal is to have the C-API work with both ndmasked
>>>>> arrays and arrayobjects (possibly by defining a base-class C-level
>>>>> structure that both ndarrays inherit from).     This might also be a good
>>>>> way for Dag to experiment with his ideas as well but that is not an
>>>>> explicit goal.
>>>>>
>>>>> One way this could work, for example is to have PyArrayObject * be the
>>>>> base-class array (essentially the same C-structure we have now with a
>>>>> HASMASK flag). Then, the ndmasked object could inherit from PyArrayObject *
>>>>> as well but add more members to the C-structure.     I think this is the
>>>>> easiest thing to do and requires the least amount of code-change.      It
>>>>> is also possible to define an abstract base-class PyArrayObject * that both
>>>>> ndarray and ndmasked inherit from.     That way ndarray and ndmasked are
>>>>> siblings even though the ndarray would essentially *be* the PyArrayObject *
>>>>> --- just with a different type-hierarchy on the python side.
>>>>>
>>>>> This work will take some time and, therefore, I don't expect 1.7 to be
>>>>> released prior to SciPy Austin with an end of June target date.   The
>>>>> timing will largely depend on what time is available from people interested
>>>>> in resolving the situation.   Mark and I will have some availability for
>>>>> this work in June but not a great deal (about 2 man-weeks total between
>>>>> us).    If there are others who can step in and help, it will help
>>>>> accelerate the process.
>>>>>
>>>>>
>>>> This will be a difficult thing for others to help with since the
>>>> concept is vague, the design decisions seem to be in your and Mark's hands,
>>>> and you say you don't have much time. It looks to me like 1.7 will keep
>>>> slipping and I don't think that is a good thing. Why not go for option 2,
>>>> which will get 1.7 out there and push the new masked array work in to 1.8?
>>>> Breaking the flow of development and release has consequences, few of them
>>>> good.
>>>>
>>>
>>> Agreed. 1.6.0 was released one year ago already, let's focus on
>>> polishing what's in there *now*. I have not followed closely what the
>>> decision was for a LTS release, but if 1.7 is supposed to be it, that's
>>> another argument about changing anything there for 1.7.
>>>
>>
>> The motivation behind splitting the mask out into a separate ndmasked is
>> primarily so that pre-existing code will not silently function on NA-masked
>> arrays and produce incorrect results. This centres around using
>> PyArray_DATA to get at the data after manually checking flags, instead of
>> calling PyArray_FromAny. Maybe a reasonable solution is to tweak the
>> behavior of PyArray_DATA? It could work as follows:
>>
>> - If an ndarray has no mask, PyArray_DATA returns the data pointer as it
>> does currently.
>> - If the ndarray has an NA-mask, PyArray_DATA sets an exception and
>> returns NULL
>> - Create a new accessor, PyArray_DATAPTR or PyArray_RAWDATA, which
>> returns the array data under all circumstances.
>>
>> This way, code which currently uses the data pointer through PyArray_DATA
>> will fail instead of silently working with the wrong interpretation of the
>> data. What do people feel about this idea?
>>
>>
> Code working with the wrong interpretation of the data doesn't bother me
> much at this point in development. Long term it matters, but in the short
> term we can't expect code not explicitly written to work with masked arrays
> to do the right thing. I think we are looking at a period of several years
> before things settle out and get accepted. First, the implementation and
> its interface needs to get close to final form, and then the long slow
> process of adoption into things like matplotlib needs to take place. I'd
> quess three to five years for that process.
>
> That said, my main concern is to move forward and not spend the next year
> waiting. I see splitting the masked code out as rather like the python
> types having pointers to sequence/numerical/etc methods, i.e., ndarray then
> looks something like an abstract class. I don't have a problem with that
> and it does avoid base object bloat. As to having PyArray_DATA fail for
> masked arrays and provide new functions for unrestricted access, I'd be
> tempted to have PyArray_DATA continue to behave as it does and let the new
> functions return the error for masked arrays. Making third party
> applications fail for masked arrays is going make masked arrays very
> unpopular. Most likely no one would use them and third party applications
> would feel no pressure to support them. Another possibility might be to
> have a compile flag that determines whether of not PyArray_Data returns an
> error for masked arrays, something like we do now for deprecating old
> macros.
>
>
My own plan for the near term would be as follows:

1) Put in the experimental option and get the 1.7 release out. This gets us
through the next couple of months and keeps things moving.

2) Look at what hooks/low level functions would let us reimplement np.ma.
Because there are so many different mask uses out there, this would be a
good way to discover what low level support is likely to provide a good
basis for others to build on.

3) Revisit the idea of making all ndarrays masked by default, but do so
with the experience and feedback from current mask users.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120519/8da60d06/attachment.html>


More information about the NumPy-Discussion mailing list