[Numpy-discussion] Masked Array for NumPy 1.7
Charles R Harris
charlesr.harris at gmail.com
Sat May 19 12:02:23 EDT 2012
On Sat, May 19, 2012 at 9:21 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Sat, May 19, 2012 at 10:00 AM, David Cournapeau <cournape at gmail.com>wrote:
>> On Sat, May 19, 2012 at 3:17 PM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>>> On Fri, May 18, 2012 at 3:47 PM, Travis Oliphant <travis at continuum.io>wrote:
>>>> Hey all,
>>>> After reading all the discussion around masked arrays and getting input
>>>> from as many people as possible, it is clear that there is still
>>>> disagreement about what to do, but there have been some fruitful
>>>> discussions that ensued.
>>>> This isn't really new as there was significant disagreement about what
>>>> to do when the masked array code was initially checked in to master. So,
>>>> in order to move forward, Mark and I are going to work together with
>>>> whomever else is willing to help with an effort that is in the spirit of my
>>>> third proposal but has a few adjustments.
>>>> The idea will be fleshed out in more detail as it progresses, but the
>>>> basic concept is to create an (experimental) ndmasked object in NumPy 1.7
>>>> and leave the actual ndarray object unchanged. While the details need to
>>>> be worked out here, a goal is to have the C-API work with both ndmasked
>>>> arrays and arrayobjects (possibly by defining a base-class C-level
>>>> structure that both ndarrays inherit from). This might also be a good
>>>> way for Dag to experiment with his ideas as well but that is not an
>>>> explicit goal.
>>>> One way this could work, for example is to have PyArrayObject * be the
>>>> base-class array (essentially the same C-structure we have now with a
>>>> HASMASK flag). Then, the ndmasked object could inherit from PyArrayObject *
>>>> as well but add more members to the C-structure. I think this is the
>>>> easiest thing to do and requires the least amount of code-change. It
>>>> is also possible to define an abstract base-class PyArrayObject * that both
>>>> ndarray and ndmasked inherit from. That way ndarray and ndmasked are
>>>> siblings even though the ndarray would essentially *be* the PyArrayObject *
>>>> --- just with a different type-hierarchy on the python side.
>>>> This work will take some time and, therefore, I don't expect 1.7 to be
>>>> released prior to SciPy Austin with an end of June target date. The
>>>> timing will largely depend on what time is available from people interested
>>>> in resolving the situation. Mark and I will have some availability for
>>>> this work in June but not a great deal (about 2 man-weeks total between
>>>> us). If there are others who can step in and help, it will help
>>>> accelerate the process.
>>> This will be a difficult thing for others to help with since the concept
>>> is vague, the design decisions seem to be in your and Mark's hands, and you
>>> say you don't have much time. It looks to me like 1.7 will keep slipping
>>> and I don't think that is a good thing. Why not go for option 2, which will
>>> get 1.7 out there and push the new masked array work in to 1.8? Breaking
>>> the flow of development and release has consequences, few of them good.
>> Agreed. 1.6.0 was released one year ago already, let's focus on polishing
>> what's in there *now*. I have not followed closely what the decision was
>> for a LTS release, but if 1.7 is supposed to be it, that's another argument
>> about changing anything there for 1.7.
> The motivation behind splitting the mask out into a separate ndmasked is
> primarily so that pre-existing code will not silently function on NA-masked
> arrays and produce incorrect results. This centres around using
> PyArray_DATA to get at the data after manually checking flags, instead of
> calling PyArray_FromAny. Maybe a reasonable solution is to tweak the
> behavior of PyArray_DATA? It could work as follows:
> - If an ndarray has no mask, PyArray_DATA returns the data pointer as it
> does currently.
> - If the ndarray has an NA-mask, PyArray_DATA sets an exception and
> returns NULL
> - Create a new accessor, PyArray_DATAPTR or PyArray_RAWDATA, which returns
> the array data under all circumstances.
> This way, code which currently uses the data pointer through PyArray_DATA
> will fail instead of silently working with the wrong interpretation of the
> data. What do people feel about this idea?
Code working with the wrong interpretation of the data doesn't bother me
much at this point in development. Long term it matters, but in the short
term we can't expect code not explicitly written to work with masked arrays
to do the right thing. I think we are looking at a period of several years
before things settle out and get accepted. First, the implementation and
its interface needs to get close to final form, and then the long slow
process of adoption into things like matplotlib needs to take place. I'd
quess three to five years for that process.
That said, my main concern is to move forward and not spend the next year
waiting. I see splitting the masked code out as rather like the python
types having pointers to sequence/numerical/etc methods, i.e., ndarray then
looks something like an abstract class. I don't have a problem with that
and it does avoid base object bloat. As to having PyArray_DATA fail for
masked arrays and provide new functions for unrestricted access, I'd be
tempted to have PyArray_DATA continue to behave as it does and let the new
functions return the error for masked arrays. Making third party
applications fail for masked arrays is going make masked arrays very
unpopular. Most likely no one would use them and third party applications
would feel no pressure to support them. Another possibility might be to
have a compile flag that determines whether of not PyArray_Data returns an
error for masked arrays, something like we do now for deprecating old
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion