[Numpy-discussion] Missing data wrap-up and request for comments

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu May 10 00:05:05 EDT 2012


On 05/10/2012 01:01 AM, Matthew Brett wrote:
> Hi,
>
> On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
>>> Hey all,
>>>
>>> Nathaniel and Mark have worked very hard on a joint document to try and
>>> explain the current status of the missing-data debate. I think they've
>>> done an amazing job at providing some context, articulating their views
>>> and suggesting ways forward in a mutually respectful manner. This is an
>>> exemplary collaboration and is at the core of why open source is valuable.
>>>
>>> The document is available here:
>>> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>>
>>> After reading that document, it appears to me that there are some
>>> fundamentally different views on how things should move forward. I'm
>>> also reading the document incorporating my understanding of the history,
>>> of NumPy as well as all of the users I've met and interacted with which
>>> means I have my own perspective that is not necessarily incorporated
>>> into that document but informs my recommendations. I'm not sure we can
>>> reach full consensus on this. We are also well past time for moving
>>> forward with a resolution on this (perhaps we can all agree on that).
>>>
>>> I would like one more discussion thread where the technical discussion
>>> can take place. I will make a plea that we keep this discussion as free
>>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>>> we can. I can't guarantee that I personally will succeed at that, but I
>>> can tell you that I will try. That's all I'm asking of anyone else. I
>>> recognize that there are a lot of other issues at play here besides
>>> *just* the technical questions, but we are not going to resolve every
>>> community issue in this technical thread.
>>>
>>> We need concrete proposals and so I will start with three. Please feel
>>> free to comment on these proposals or add your own during the
>>> discussion. I will stop paying attention to this thread next Wednesday
>>> (May 16th) (or earlier if the thread dies) and hope that by that time we
>>> can agree on a way forward. If we don't have agreement, then I will move
>>> forward with what I think is the right approach. I will either write the
>>> code myself or convince someone else to write it.
>>>
>>> In all cases, we have agreement that bit-pattern dtypes should be added
>>> to NumPy. We should work on these (int32, float64, complex64, str, bool)
>>> to start. So, the three proposals are independent of this way forward.
>>> The proposals are all about the extra mask part:
>>>
>>> My three proposals:
>>>
>>> * do nothing and leave things as is
>>>
>>> * add a global flag that turns off masked array support by default but
>>> otherwise leaves things unchanged (I'm still unclear how this would work
>>> exactly)
>>>
>>> * move Mark's "masked ndarray objects" into a new fundamental type
>>> (ndmasked), leaving the actual ndarray type unchanged. The
>>> array_interface keeps the masked array notions and the ufuncs keep the
>>> ability to handle arrays like ndmasked. Ideally, numpy.ma
>>> <http://numpy.ma>  would be changed to use ndmasked objects as their core.
>>>
>>> For the record, I'm currently in favor of the third proposal. Feel free
>>> to comment on these proposals (or provide your own).
>>>
>>
>> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!
>
> Yes, it is very well written, my compliments to the chefs.
>
>> The third proposal is certainly the best one from Cython's perspective;
>> and I imagine for those writing C extensions against the C API too.
>> Having PyType_Check fail for ndmasked is a very good way of having code
>> fail that is not written to take masks into account.

I want to make something more clear: There are two Cython cases; in the 
case of "cdef np.ndarray[double]" there is no problem as PEP 3118 access 
will raise an exception for masked arrays.

But, there's the case where you do "cdef np.ndarray", and then proceed 
to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually 
because I pass the data pointer to some C or C++ code.

It'd be great to have such code be forward-compatible in the sense that 
it raises an exception when it meets a masked array. Having PyType_Check 
fail seems like the only way? Am I wrong?


> Mark, Nathaniel - can you comment how your chosen approaches would
> interact with extension code?
>
> I'm guessing the bitpattern dtypes would be expected to cause
> extension code to choke if the type is not supported?

The proposal, as I understand it, is to use that with new dtypes (?). So 
things will often be fine for that reason:

if arr.dtype == np.float32:
     c_function_32bit(np.PyArray_DATA(arr), ...)
else:
     raise ValueError("need 32-bit float array")


>
> Mark - in :
>
> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython
>
> - do I understand correctly that you think that Cython and other
> extension writers should use the numpy API to access the data rather
> than accessing it directly via the data pointer and strides?

That's not really fleshed out (for all the different usecases etc.); I 
read that as "let's discuss Cython later, when this is actively used in 
NumPy". Which sounds reasonable to me.

Dag



More information about the NumPy-Discussion mailing list