[Numpy-discussion] NA-mask interactions with existing C code

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu May 10 21:01:06 EDT 2012



Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:

>On 05/11/2012 01:06 AM, Mark Wiebe wrote:
>> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>>
>wrote:
>>
>>     On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>>      > I did some searching for typical Cython and C code which
>accesses
>>     numpy
>>      > arrays, and added a section to the NEP describing how they
>behave
>>     in the
>>      > current implementation. Cython code which uses either straight
>Python
>>      > access or the buffer protocol is fine (after a bugfix in
>numpy, it
>>      > wasn't failing currently as it should in the pep3118 case). C
>>     code which
>>      > follows the recommended practice of using PyArray_FromAny or
>one
>>     of the
>>      > related macros is also fine, because these functions have been
>>     made to
>>      > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
>>     provided.
>>      >
>>      > In general, code which follows the recommended numpy practices
>will
>>      > raise exceptions when encountering NA-masked arrays. This
>means
>>      > programmers don't have to worry about the NA unless they want
>to
>>     support
>>      > it. Having things go through PyArray_FromAny also provides a
>>     place where
>>      > lazy evaluation arrays could be evaluated, and other similar
>>     potential
>>      > future extensions can use to provide compatibility.
>>      >
>>      > Here's the section I added to the NEP:
>>      >
>>      > Interaction With Pre-existing C API Usage
>>      > =========================================
>>      >
>>      > Making sure existing code using the C API, whether it's
>written
>>     in C, C++,
>>      > or Cython, does something reasonable is an important goal of
>this
>>      > implementation.
>>      > The general strategy is to make existing code which does not
>>     explicitly
>>      > tell numpy it supports NA masks fail with an exception saying
>so.
>>     There are
>>      > a few different access patterns people use to get ahold of the
>numpy
>>      > array data,
>>      > here we examine a few of them to see what numpy can do. These
>>     examples are
>>      > found from doing google searches of numpy C API array access.
>>      >
>>      > Numpy Documentation - How to extend NumPy
>>      > -----------------------------------------
>>      >
>>      >
>>    
>http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>      >
>>      > This page has a section "Dealing with array objects" which has
>some
>>      > advice for how
>>      > to access numpy arrays from C. When accepting arrays, the
>first
>>     step it
>>      > suggests is
>>      > to use PyArray_FromAny or a macro built on that function, so
>code
>>      > following this
>>      > advice will properly fail when given an NA-masked array it
>>     doesn't know
>>      > how to handle.
>>      >
>>      > The way this is handled is that PyArray_FromAny requires a
>>     special flag,
>>      > NPY_ARRAY_ALLOWNA,
>>      > before it will allow NA-masked arrays to flow through.
>>      >
>>      >
>>    
>http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>      >
>>      > Code which does not follow this advice, and instead just calls
>>      > PyArray_Check() to verify
>>      > its an ndarray and checks some flags, will silently produce
>incorrect
>>      > results. This style
>>      > of code does not provide any opportunity for numpy to say
>"hey, this
>>      > array is special",
>>      > so also is not compatible with future ideas of lazy
>evaluation,
>>     derived
>>      > dtypes, etc.
>>
>>     This doesn't really cover the Cython code I write that interfaces
>with C
>>     (and probably the code others write in Cython).
>>
>>     Often I'd do:
>>
>>     def f(arg):
>>          cdef np.ndarray arr = np.asarray(arg)
>>          c_func(np.PyArray_DATA(arr))
>>
>>     So I mix Python np.asarray with C PyArray_DATA. In general, I
>think you
>>     use PyArray_FromAny if you're very concerned about performance or
>need
>>     some special flag, but it's certainly not the first thing you
>tgry.
>>
>>
>> I guess this mixture of Python-API and C-API is different from the
>way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use.
>From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there
>is
>> a reasonable approach here which works for everything.
>
>Does that mean you consider changing ob_type for masked arrays 
>unreasonable? They can still use the same object struct...
>
>>
>>     But in general, I will often be lazy and just do
>>
>>     def f(np.ndarray arr):
>>          c_func(np.PyArray_DATA(arr))
>>
>>     It's an exception if you don't provide an array -- so who cares.
>(I
>>     guess the odds of somebody feeding a masked array to code like
>that,
>>     which doesn't try to be friendly, is relatively smaller though.)
>>
>>
>> This code would already fail with non-contiguous strides or
>byte-swapped
>> data, so the additional NA mask case seems to fit in an
>already-failing
>> category.
>
>Honestly! I hope you did't think I provided a full-fledged example? 
>Perhaps you'd like to point out to me that "c_func" is a bad name for a
>
>function as well?

I keep having to apologise; I now see how you must have read my example, with me referring to 'lazy'. Anyway, I just meant that I would be too lazy to deal with somebody passing anything but exactly the right array -- too lazy to deal with conversion. In particular for output arrays, checking flags and dtype is just faster to code down than checking the FromAny docs for the right flags.

Dag

>
>One would of course check that things are contiguous (or pass on the 
>strides), check the dtype and dispatch to different C functions in each
>
>case, etc.
>
>But that isn't the point. Scientific code most of the time does fall in
>
>the "already-failing" category. That doesn't mean it doesn't count. 
>Let's focus on the number of code lines written and developer hours
>that 
>will be spent cleaning up the mess -- not the "validity" of the code in
>
>question.
>
>>
>>
>>     If you know the datatype, you can really do
>>
>>     def f(np.ndarray[double] arr):
>>          c_func(&arr[0])
>>
>>     which works with PEP 3118. But I use PyArray_DATA out of habit
>(and
>>     since it works in the cases without dtype).
>>
>>     Frankly, I don't expect any Cython code to do the right thing
>here;
>>     calling PyArray_FromAny is much more typing. And really, nobody
>ever
>>     questioned that if we had an actual ndarray instance, we'd be
>allowed to
>>     call PyArray_DATA.
>>
>>     I don't know how much Cython code is out there in the wild for
>which
>>     this is a problem. Either way, it would cause something of a
>reeducation
>>     challenge for Cython users.
>>
>>
>> Since this style of coding already has known problems, do you think
>the
>> case with NA-masks deserves more attention here? What will happen is.
>> access to array element data without consideration of the mask, which
>> seems similar in nature to accessing array data with the wrong stride
>or
>> byte order.
>
>I don't agree with the premise of that paragraph. There's no reason to 
>assume that just because code doesn't call FromAny, it has problems. 
>(And I'll continue to assume that whatever array is returned from 
>"np.ascontiguousarray is really contiguous...)
>
>Whether it requires attention or not is a different issue though. I'm 
>not sure. I think other people should weigh in on that -- I mostly
>write 
>code for my own consumption.
>
>One should at least check pandas, scikits-image, scikits-learn, mpi4py,
>
>petsc4py, and so on. And ask on the Cython users list. Hopefully it
>will 
>usually be PEP 3118. But now I need to turn in.
>
>Travis, would such a survey be likely to affect the outcome of your 
>decision in any way? Or should we just leave this for now?
>
>Dag
>
>>
>> Cheers,
>> Mark
>>
>>     Dag
>>
>>      >
>>      > Tutorial From Cython Website
>>      > ----------------------------
>>      >
>>      > http://docs.cython.org/src/tutorial/numpy.html
>>      >
>>      > This tutorial gives a convolution example, and all the
>examples
>>     fail with
>>      > Python exceptions when given inputs that contain NA values.
>>      >
>>      > Before any Cython type annotation is introduced, the code
>>     functions just
>>      > as equivalent Python would in the interpreter.
>>      >
>>      > When the type information is introduced, it is done via
>numpy.pxd
>>     which
>>      > defines a mapping between an ndarray declaration and
>>     PyArrayObject \*.
>>      > Under the hood, this maps to __Pyx_ArgTypeTest, which does a
>direct
>>      > comparison of Py_TYPE(obj) against the PyTypeObject for the
>ndarray.
>>      >
>>      > Then the code does some dtype comparisons, and uses regular
>>     python indexing
>>      > to access the array elements. This python indexing still goes
>>     through the
>>      > Python API, so the NA handling and error checking in numpy
>still
>>     can work
>>      > like normal and fail if the inputs have NAs which cannot fit
>in
>>     the output
>>      > array. In this case it fails when trying to convert the NA
>into
>>     an integer
>>      > to set in in the output.
>>      >
>>      > The next version of the code introduces more efficient
>indexing. This
>>      > operates based on Python's buffer protocol. This causes Cython
>to
>>     call
>>      > __Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which
>calls
>>      > PyObject_GetBuffer. This call gives numpy the opportunity to
>raise an
>>      > exception if the inputs are arrays with NA-masks, something
>not
>>     supported
>>      > by the Python buffer protocol.
>>      >
>>      > Numerical Python - JPL website
>>      > ------------------------------
>>      >
>>      >
>http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html
>>      >
>>      > This document is from 2001, so does not reflect recent numpy,
>but
>>     it is the
>>      > second hit when searching for "numpy c api example" on google.
>>      >
>>      > There first example, heading "A simple example", is in fact
>already
>>      > invalid for
>>      > recent numpy even without the NA support. In particular, if
>the
>>     data is
>>      > misaligned
>>      > or in a different byteorder, it may crash or produce incorrect
>>     results.
>>      >
>>      > The next thing the document does is introduce
>>      > PyArray_ContiguousFromObject, which
>>      > gives numpy an opportunity to raise an exception when
>NA-masked
>>     arrays
>>      > are used,
>>      > so the later code will raise exceptions as desired.
>>      >
>>      >
>>      >
>>      > _______________________________________________
>>      > NumPy-Discussion mailing list
>>      > NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>      > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>     _______________________________________________
>>     NumPy-Discussion mailing list
>>     NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>     http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>_______________________________________________
>NumPy-Discussion mailing list
>NumPy-Discussion at scipy.org
>http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.



More information about the NumPy-Discussion mailing list