[Numpy-discussion] NA-mask interactions with existing C code

Travis Oliphant travis at continuum.io
Fri May 11 01:36:18 EDT 2012

>> I guess this mixture of Python-API and C-API is different from the way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use. From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there is
>> a reasonable approach here which works for everything.
> Does that mean you consider changing ob_type for masked arrays 
> unreasonable? They can still use the same object struct...
>>    But in general, I will often be lazy and just do
>>    def f(np.ndarray arr):
>>         c_func(np.PyArray_DATA(arr))
>>    It's an exception if you don't provide an array -- so who cares. (I
>>    guess the odds of somebody feeding a masked array to code like that,
>>    which doesn't try to be friendly, is relatively smaller though.)
>> This code would already fail with non-contiguous strides or byte-swapped
>> data, so the additional NA mask case seems to fit in an already-failing
>> category.
> Honestly! I hope you did't think I provided a full-fledged example? 
> Perhaps you'd like to point out to me that "c_func" is a bad name for a 
> function as well?
> One would of course check that things are contiguous (or pass on the 
> strides), check the dtype and dispatch to different C functions in each 
> case, etc.
> But that isn't the point. Scientific code most of the time does fall in 
> the "already-failing" category. That doesn't mean it doesn't count. 
> Let's focus on the number of code lines written and developer hours that 
> will be spent cleaning up the mess -- not the "validity" of the code in 
> question.
>>    If you know the datatype, you can really do
>>    def f(np.ndarray[double] arr):
>>         c_func(&arr[0])
>>    which works with PEP 3118. But I use PyArray_DATA out of habit (and
>>    since it works in the cases without dtype).
>>    Frankly, I don't expect any Cython code to do the right thing here;
>>    calling PyArray_FromAny is much more typing. And really, nobody ever
>>    questioned that if we had an actual ndarray instance, we'd be allowed to
>>    call PyArray_DATA.
>>    I don't know how much Cython code is out there in the wild for which
>>    this is a problem. Either way, it would cause something of a reeducation
>>    challenge for Cython users.
>> Since this style of coding already has known problems, do you think the
>> case with NA-masks deserves more attention here? What will happen is.
>> access to array element data without consideration of the mask, which
>> seems similar in nature to accessing array data with the wrong stride or
>> byte order.
> I don't agree with the premise of that paragraph. There's no reason to 
> assume that just because code doesn't call FromAny, it has problems. 
> (And I'll continue to assume that whatever array is returned from 
> "np.ascontiguousarray is really contiguous...)
> Whether it requires attention or not is a different issue though. I'm 
> not sure. I think other people should weigh in on that -- I mostly write 
> code for my own consumption.
> One should at least check pandas, scikits-image, scikits-learn, mpi4py, 
> petsc4py, and so on. And ask on the Cython users list. Hopefully it will 
> usually be PEP 3118. But now I need to turn in.
> Travis, would such a survey be likely to affect the outcome of your 
> decision in any way? Or should we just leave this for now?

This dialog gets at the heart of the matter, I think.   The NEP seems to want NumPy to have a "better" API that always protects downstream users from understanding what is actually under the covers.   It would prefer to push NumPy in the direction of an array object that is fundamentally more opaque.   However, the world NumPy lives in is decidedly not opaque.   There has been significant education and shared understanding of what a NumPy array actually *is* (a strided view of memory of a particular "dtype").   This shared understanding has even been pushed into Python as the buffer protocol.    It is very common for extension modules to go directly to the data they want by using this understanding.    

This is very different from the traditional "shield your users" from how things are actually done view of most object APIs.    It was actually intentional.      I'm not saying that different choices could not have been made or that some amount of shielding should never be contemplated.   I'm just saying that NumPy has been used as a nice bridge between the world of scientific computing codes that have chunks of memory allocated for processing and high-level code.   Part of the reason for this bridge has been the simple object model.  

I just don't think the NEP fully appreciates just how fundamental of a shift this is in the wider NumPy community and it is not something that can be done immediately or without careful attention. 

Dag, is an *active* member in that larger group of C-consumers of NumPy arrays.  As a long-time member of that group, myself, this is where my concerns are coming from.   So far I am not hearing anything to alleviate those concerns.   

See my post in the other thread for my proposal to add a flag that allows users to switch between the Python side default being ndarray's or ndmasked, but they are different types at the C-level.    The proposal so far does not specify whether or not ndarray or ndmasked is a subclass of the other.   Given the history of numpy.ma and the fact that it makes sense on the C-level, I would lean toward ndmasked being a sub-class of ndarray --- thus a C-user would have to do a PyArray_CheckExact to ensure they are getting a base Python Array Object --- which they would have to do anyway because numpy.ma arrays also pass PyArray_Check.  

Best regards,


> Dag
>> Cheers,
>> Mark
>>    Dag
>>> Tutorial From Cython Website
>>> ----------------------------
>>> http://docs.cython.org/src/tutorial/numpy.html
>>> This tutorial gives a convolution example, and all the examples
>>    fail with
>>> Python exceptions when given inputs that contain NA values.
>>> Before any Cython type annotation is introduced, the code
>>    functions just
>>> as equivalent Python would in the interpreter.
>>> When the type information is introduced, it is done via numpy.pxd
>>    which
>>> defines a mapping between an ndarray declaration and
>>    PyArrayObject \*.
>>> Under the hood, this maps to __Pyx_ArgTypeTest, which does a direct
>>> comparison of Py_TYPE(obj) against the PyTypeObject for the ndarray.
>>> Then the code does some dtype comparisons, and uses regular
>>    python indexing
>>> to access the array elements. This python indexing still goes
>>    through the
>>> Python API, so the NA handling and error checking in numpy still
>>    can work
>>> like normal and fail if the inputs have NAs which cannot fit in
>>    the output
>>> array. In this case it fails when trying to convert the NA into
>>    an integer
>>> to set in in the output.
>>> The next version of the code introduces more efficient indexing. This
>>> operates based on Python's buffer protocol. This causes Cython to
>>    call
>>> __Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which calls
>>> PyObject_GetBuffer. This call gives numpy the opportunity to raise an
>>> exception if the inputs are arrays with NA-masks, something not
>>    supported
>>> by the Python buffer protocol.
>>> Numerical Python - JPL website
>>> ------------------------------
>>> http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html
>>> This document is from 2001, so does not reflect recent numpy, but
>>    it is the
>>> second hit when searching for "numpy c api example" on google.
>>> There first example, heading "A simple example", is in fact already
>>> invalid for
>>> recent numpy even without the NA support. In particular, if the
>>    data is
>>> misaligned
>>> or in a different byteorder, it may crash or produce incorrect
>>    results.
>>> The next thing the document does is introduce
>>> PyArray_ContiguousFromObject, which
>>> gives numpy an opportunity to raise an exception when NA-masked
>>    arrays
>>> are used,
>>> so the later code will raise exceptions as desired.
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>    _______________________________________________
>>    NumPy-Discussion mailing list
>>    NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>    http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

More information about the NumPy-Discussion mailing list