[Numpy-discussion] NA-mask interactions with existing C code

Thu May 10 19:02:37 EDT 2012

On 05/11/2012 12:47 AM, Dag Sverre Seljebotn wrote:
> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>> I did some searching for typical Cython and C code which accesses numpy
>> arrays, and added a section to the NEP describing how they behave in the
>> current implementation. Cython code which uses either straight Python
>> access or the buffer protocol is fine (after a bugfix in numpy, it
>> wasn't failing currently as it should in the pep3118 case). C code which
>> follows the recommended practice of using PyArray_FromAny or one of the
>> related macros is also fine, because these functions have been made to
>> fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is provided.
>>
>> In general, code which follows the recommended numpy practices will
>> raise exceptions when encountering NA-masked arrays. This means
>> programmers don't have to worry about the NA unless they want to support
>> it. Having things go through PyArray_FromAny also provides a place where
>> lazy evaluation arrays could be evaluated, and other similar potential
>> future extensions can use to provide compatibility.
>>
>> Here's the section I added to the NEP:
>>
>> Interaction With Pre-existing C API Usage
>> =========================================
>>
>> Making sure existing code using the C API, whether it's written in C, C++,
>> or Cython, does something reasonable is an important goal of this
>> implementation.
>> The general strategy is to make existing code which does not explicitly
>> tell numpy it supports NA masks fail with an exception saying so. There are
>> a few different access patterns people use to get ahold of the numpy
>> array data,
>> here we examine a few of them to see what numpy can do. These examples are
>> found from doing google searches of numpy C API array access.
>>
>> Numpy Documentation - How to extend NumPy
>> -----------------------------------------
>>
>> http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>
>> This page has a section "Dealing with array objects" which has some
>> advice for how
>> to access numpy arrays from C. When accepting arrays, the first step it
>> suggests is
>> to use PyArray_FromAny or a macro built on that function, so code
>> following this
>> advice will properly fail when given an NA-masked array it doesn't know
>> how to handle.
>>
>> The way this is handled is that PyArray_FromAny requires a special flag,
>> NPY_ARRAY_ALLOWNA,
>> before it will allow NA-masked arrays to flow through.
>>
>> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>
>> Code which does not follow this advice, and instead just calls
>> PyArray_Check() to verify
>> its an ndarray and checks some flags, will silently produce incorrect
>> results. This style
>> of code does not provide any opportunity for numpy to say "hey, this
>> array is special",
>> so also is not compatible with future ideas of lazy evaluation, derived
>> dtypes, etc.
>
> This doesn't really cover the Cython code I write that interfaces with C
> (and probably the code others write in Cython).
>
> Often I'd do:
>
> def f(arg):
>       cdef np.ndarray arr = np.asarray(arg)
>       c_func(np.PyArray_DATA(arr))
>
> So I mix Python np.asarray with C PyArray_DATA. In general, I think you
> use PyArray_FromAny if you're very concerned about performance or need
> some special flag, but it's certainly not the first thing you tgry.
>
> But in general, I will often be lazy and just do
>
> def f(np.ndarray arr):
>       c_func(np.PyArray_DATA(arr))
>
> It's an exception if you don't provide an array -- so who cares. (I
> guess the odds of somebody feeding a masked array to code like that,
> which doesn't try to be friendly, is relatively smaller though.)
>
> If you know the datatype, you can really do
>
> def f(np.ndarray[double] arr):
>       c_func(&arr[0])
>
> which works with PEP 3118. But I use PyArray_DATA out of habit (and
> since it works in the cases without dtype).
>
> Frankly, I don't expect any Cython code to do the right thing here;
> calling PyArray_FromAny is much more typing. And really, nobody ever
> questioned that if we had an actual ndarray instance, we'd be allowed to
> call PyArray_DATA.
>
> I don't know how much Cython code is out there in the wild for which
> this is a problem. Either way, it would cause something of a reeducation
> challenge for Cython users.

Also note that Cython users are in the habit of accessing "arr.data" 
(which is the char*, not the buffer object) directly. Just in case you 
had the idea of grepping for PyArray_DATA in Cython code.

Our plan there is we'll eventually put out a Cython version which 
special-cases np.ndarray and turn ".data" into a call to PyArray_DATA 
(and same for shape, strides, ...). Ugly hack, but avoids breaking 
existing Cython code if NumPy removes the field access.

Dag

>
> Dag
>
>>
>> Tutorial From Cython Website
>> ----------------------------
>>
>> http://docs.cython.org/src/tutorial/numpy.html
>>
>> This tutorial gives a convolution example, and all the examples fail with
>> Python exceptions when given inputs that contain NA values.
>>
>> Before any Cython type annotation is introduced, the code functions just
>> as equivalent Python would in the interpreter.
>>
>> When the type information is introduced, it is done via numpy.pxd which
>> defines a mapping between an ndarray declaration and PyArrayObject \*.
>> Under the hood, this maps to __Pyx_ArgTypeTest, which does a direct
>> comparison of Py_TYPE(obj) against the PyTypeObject for the ndarray.
>>
>> Then the code does some dtype comparisons, and uses regular python indexing
>> to access the array elements. This python indexing still goes through the
>> Python API, so the NA handling and error checking in numpy still can work
>> like normal and fail if the inputs have NAs which cannot fit in the output
>> array. In this case it fails when trying to convert the NA into an integer
>> to set in in the output.
>>
>> The next version of the code introduces more efficient indexing. This
>> operates based on Python's buffer protocol. This causes Cython to call
>> __Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which calls
>> PyObject_GetBuffer. This call gives numpy the opportunity to raise an
>> exception if the inputs are arrays with NA-masks, something not supported
>> by the Python buffer protocol.
>>
>> Numerical Python - JPL website
>> ------------------------------
>>
>> http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html
>>
>> This document is from 2001, so does not reflect recent numpy, but it is the
>> second hit when searching for "numpy c api example" on google.
>>
>> There first example, heading "A simple example", is in fact already
>> invalid for
>> recent numpy even without the NA support. In particular, if the data is
>> misaligned
>> or in a different byteorder, it may crash or produce incorrect results.
>>
>> The next thing the document does is introduce
>> PyArray_ContiguousFromObject, which
>> gives numpy an opportunity to raise an exception when NA-masked arrays
>> are used,
>> so the later code will raise exceptions as desired.
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion