[Numpy-discussion] Missing data wrap-up and request for comments

Matthew Brett matthew.brett at gmail.com
Wed May 9 19:01:25 EDT 2012


On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
>> Hey all,
>> Nathaniel and Mark have worked very hard on a joint document to try and
>> explain the current status of the missing-data debate. I think they've
>> done an amazing job at providing some context, articulating their views
>> and suggesting ways forward in a mutually respectful manner. This is an
>> exemplary collaboration and is at the core of why open source is valuable.
>> The document is available here:
>> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>> After reading that document, it appears to me that there are some
>> fundamentally different views on how things should move forward. I'm
>> also reading the document incorporating my understanding of the history,
>> of NumPy as well as all of the users I've met and interacted with which
>> means I have my own perspective that is not necessarily incorporated
>> into that document but informs my recommendations. I'm not sure we can
>> reach full consensus on this. We are also well past time for moving
>> forward with a resolution on this (perhaps we can all agree on that).
>> I would like one more discussion thread where the technical discussion
>> can take place. I will make a plea that we keep this discussion as free
>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>> we can. I can't guarantee that I personally will succeed at that, but I
>> can tell you that I will try. That's all I'm asking of anyone else. I
>> recognize that there are a lot of other issues at play here besides
>> *just* the technical questions, but we are not going to resolve every
>> community issue in this technical thread.
>> We need concrete proposals and so I will start with three. Please feel
>> free to comment on these proposals or add your own during the
>> discussion. I will stop paying attention to this thread next Wednesday
>> (May 16th) (or earlier if the thread dies) and hope that by that time we
>> can agree on a way forward. If we don't have agreement, then I will move
>> forward with what I think is the right approach. I will either write the
>> code myself or convince someone else to write it.
>> In all cases, we have agreement that bit-pattern dtypes should be added
>> to NumPy. We should work on these (int32, float64, complex64, str, bool)
>> to start. So, the three proposals are independent of this way forward.
>> The proposals are all about the extra mask part:
>> My three proposals:
>> * do nothing and leave things as is
>> * add a global flag that turns off masked array support by default but
>> otherwise leaves things unchanged (I'm still unclear how this would work
>> exactly)
>> * move Mark's "masked ndarray objects" into a new fundamental type
>> (ndmasked), leaving the actual ndarray type unchanged. The
>> array_interface keeps the masked array notions and the ufuncs keep the
>> ability to handle arrays like ndmasked. Ideally, numpy.ma
>> <http://numpy.ma> would be changed to use ndmasked objects as their core.
>> For the record, I'm currently in favor of the third proposal. Feel free
>> to comment on these proposals (or provide your own).
> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

Yes, it is very well written, my compliments to the chefs.

> The third proposal is certainly the best one from Cython's perspective;
> and I imagine for those writing C extensions against the C API too.
> Having PyType_Check fail for ndmasked is a very good way of having code
> fail that is not written to take masks into account.

Mark, Nathaniel - can you comment how your chosen approaches would
interact with extension code?

I'm guessing the bitpattern dtypes would be expected to cause
extension code to choke if the type is not supported?

Mark - in :


- do I understand correctly that you think that Cython and other
extension writers should use the numpy API to access the data rather
than accessing it directly via the data pointer and strides?



More information about the NumPy-Discussion mailing list