[Numpy-discussion] Missing data wrap-up and request for comments
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Thu May 10 00:05:05 EDT 2012
On 05/10/2012 01:01 AM, Matthew Brett wrote:
> On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no> wrote:
>> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
>>> Hey all,
>>> Nathaniel and Mark have worked very hard on a joint document to try and
>>> explain the current status of the missing-data debate. I think they've
>>> done an amazing job at providing some context, articulating their views
>>> and suggesting ways forward in a mutually respectful manner. This is an
>>> exemplary collaboration and is at the core of why open source is valuable.
>>> The document is available here:
>>> After reading that document, it appears to me that there are some
>>> fundamentally different views on how things should move forward. I'm
>>> also reading the document incorporating my understanding of the history,
>>> of NumPy as well as all of the users I've met and interacted with which
>>> means I have my own perspective that is not necessarily incorporated
>>> into that document but informs my recommendations. I'm not sure we can
>>> reach full consensus on this. We are also well past time for moving
>>> forward with a resolution on this (perhaps we can all agree on that).
>>> I would like one more discussion thread where the technical discussion
>>> can take place. I will make a plea that we keep this discussion as free
>>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>>> we can. I can't guarantee that I personally will succeed at that, but I
>>> can tell you that I will try. That's all I'm asking of anyone else. I
>>> recognize that there are a lot of other issues at play here besides
>>> *just* the technical questions, but we are not going to resolve every
>>> community issue in this technical thread.
>>> We need concrete proposals and so I will start with three. Please feel
>>> free to comment on these proposals or add your own during the
>>> discussion. I will stop paying attention to this thread next Wednesday
>>> (May 16th) (or earlier if the thread dies) and hope that by that time we
>>> can agree on a way forward. If we don't have agreement, then I will move
>>> forward with what I think is the right approach. I will either write the
>>> code myself or convince someone else to write it.
>>> In all cases, we have agreement that bit-pattern dtypes should be added
>>> to NumPy. We should work on these (int32, float64, complex64, str, bool)
>>> to start. So, the three proposals are independent of this way forward.
>>> The proposals are all about the extra mask part:
>>> My three proposals:
>>> * do nothing and leave things as is
>>> * add a global flag that turns off masked array support by default but
>>> otherwise leaves things unchanged (I'm still unclear how this would work
>>> * move Mark's "masked ndarray objects" into a new fundamental type
>>> (ndmasked), leaving the actual ndarray type unchanged. The
>>> array_interface keeps the masked array notions and the ufuncs keep the
>>> ability to handle arrays like ndmasked. Ideally, numpy.ma
>>> <http://numpy.ma> would be changed to use ndmasked objects as their core.
>>> For the record, I'm currently in favor of the third proposal. Feel free
>>> to comment on these proposals (or provide your own).
>> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!
> Yes, it is very well written, my compliments to the chefs.
>> The third proposal is certainly the best one from Cython's perspective;
>> and I imagine for those writing C extensions against the C API too.
>> Having PyType_Check fail for ndmasked is a very good way of having code
>> fail that is not written to take masks into account.
I want to make something more clear: There are two Cython cases; in the
case of "cdef np.ndarray[double]" there is no problem as PEP 3118 access
will raise an exception for masked arrays.
But, there's the case where you do "cdef np.ndarray", and then proceed
to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually
because I pass the data pointer to some C or C++ code.
It'd be great to have such code be forward-compatible in the sense that
it raises an exception when it meets a masked array. Having PyType_Check
fail seems like the only way? Am I wrong?
> Mark, Nathaniel - can you comment how your chosen approaches would
> interact with extension code?
> I'm guessing the bitpattern dtypes would be expected to cause
> extension code to choke if the type is not supported?
The proposal, as I understand it, is to use that with new dtypes (?). So
things will often be fine for that reason:
if arr.dtype == np.float32:
raise ValueError("need 32-bit float array")
> Mark - in :
> - do I understand correctly that you think that Cython and other
> extension writers should use the numpy API to access the data rather
> than accessing it directly via the data pointer and strides?
That's not really fleshed out (for all the different usecases etc.); I
read that as "let's discuss Cython later, when this is actively used in
NumPy". Which sounds reasonable to me.
More information about the NumPy-Discussion