[Numpy-discussion] NA-mask interactions with existing C code

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu May 10 20:35:57 EDT 2012



Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:

>On 05/11/2012 01:06 AM, Mark Wiebe wrote:
>> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>>
>wrote:
>>
>>     On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>>      > I did some searching for typical Cython and C code which
>accesses
>>     numpy
>>      > arrays, and added a section to the NEP describing how they
>behave
>>     in the
>>      > current implementation. Cython code which uses either straight
>Python
>>      > access or the buffer protocol is fine (after a bugfix in
>numpy, it
>>      > wasn't failing currently as it should in the pep3118 case). C
>>     code which
>>      > follows the recommended practice of using PyArray_FromAny or
>one
>>     of the
>>      > related macros is also fine, because these functions have been
>>     made to
>>      > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
>>     provided.
>>      >
>>      > In general, code which follows the recommended numpy practices
>will
>>      > raise exceptions when encountering NA-masked arrays. This
>means
>>      > programmers don't have to worry about the NA unless they want
>to
>>     support
>>      > it. Having things go through PyArray_FromAny also provides a
>>     place where
>>      > lazy evaluation arrays could be evaluated, and other similar
>>     potential
>>      > future extensions can use to provide compatibility.
>>      >
>>      > Here's the section I added to the NEP:
>>      >
>>      > Interaction With Pre-existing C API Usage
>>      > =========================================
>>      >
>>      > Making sure existing code using the C API, whether it's
>written
>>     in C, C++,
>>      > or Cython, does something reasonable is an important goal of
>this
>>      > implementation.
>>      > The general strategy is to make existing code which does not
>>     explicitly
>>      > tell numpy it supports NA masks fail with an exception saying
>so.
>>     There are
>>      > a few different access patterns people use to get ahold of the
>numpy
>>      > array data,
>>      > here we examine a few of them to see what numpy can do. These
>>     examples are
>>      > found from doing google searches of numpy C API array access.
>>      >
>>      > Numpy Documentation - How to extend NumPy
>>      > -----------------------------------------
>>      >
>>      >
>>    
>http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>      >
>>      > This page has a section "Dealing with array objects" which has
>some
>>      > advice for how
>>      > to access numpy arrays from C. When accepting arrays, the
>first
>>     step it
>>      > suggests is
>>      > to use PyArray_FromAny or a macro built on that function, so
>code
>>      > following this
>>      > advice will properly fail when given an NA-masked array it
>>     doesn't know
>>      > how to handle.
>>      >
>>      > The way this is handled is that PyArray_FromAny requires a
>>     special flag,
>>      > NPY_ARRAY_ALLOWNA,
>>      > before it will allow NA-masked arrays to flow through.
>>      >
>>      >
>>    
>http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>      >
>>      > Code which does not follow this advice, and instead just calls
>>      > PyArray_Check() to verify
>>      > its an ndarray and checks some flags, will silently produce
>incorrect
>>      > results. This style
>>      > of code does not provide any opportunity for numpy to say
>"hey, this
>>      > array is special",
>>      > so also is not compatible with future ideas of lazy
>evaluation,
>>     derived
>>      > dtypes, etc.
>>
>>     This doesn't really cover the Cython code I write that interfaces
>with C
>>     (and probably the code others write in Cython).
>>
>>     Often I'd do:
>>
>>     def f(arg):
>>          cdef np.ndarray arr = np.asarray(arg)
>>          c_func(np.PyArray_DATA(arr))
>>
>>     So I mix Python np.asarray with C PyArray_DATA. In general, I
>think you
>>     use PyArray_FromAny if you're very concerned about performance or
>need
>>     some special flag, but it's certainly not the first thing you
>tgry.
>>
>>
>> I guess this mixture of Python-API and C-API is different from the
>way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use.
>From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there
>is
>> a reasonable approach here which works for everything.
>
>Does that mean you consider changing ob_type for masked arrays 
>unreasonable? They can still use the same object struct...
>
>>
>>     But in general, I will often be lazy and just do
>>
>>     def f(np.ndarray arr):
>>          c_func(np.PyArray_DATA(arr))
>>
>>     It's an exception if you don't provide an array -- so who cares.
>(I
>>     guess the odds of somebody feeding a masked array to code like
>that,
>>     which doesn't try to be friendly, is relatively smaller though.)
>>
>>
>> This code would already fail with non-contiguous strides or
>byte-swapped
>> data, so the additional NA mask case seems to fit in an
>already-failing
>> category.
>
>Honestly! I hope you did't think I provided a full-fledged example? 
>Perhaps you'd like to point out to me that "c_func" is a bad name for a
>
>function as well?
>
>One would of course check that things are contiguous (or pass on the 
>strides), check the dtype and dispatch to different C functions in each
>
>case, etc.
>
>But that isn't the point. Scientific code most of the time does fall in
>
>the "already-failing" category. That doesn't mean it doesn't count. 
>Let's focus on the number of code lines written and developer hours
>that 
>will be spent cleaning up the mess -- not the "validity" of the code in
>
>question.
>
>>
>>
>>     If you know the datatype, you can really do
>>
>>     def f(np.ndarray[double] arr):
>>          c_func(&arr[0])
>>
>>     which works with PEP 3118. But I use PyArray_DATA out of habit
>(and
>>     since it works in the cases without dtype).
>>
>>     Frankly, I don't expect any Cython code to do the right thing
>here;
>>     calling PyArray_FromAny is much more typing. And really, nobody
>ever
>>     questioned that if we had an actual ndarray instance, we'd be
>allowed to
>>     call PyArray_DATA.
>>
>>     I don't know how much Cython code is out there in the wild for
>which
>>     this is a problem. Either way, it would cause something of a
>reeducation
>>     challenge for Cython users.
>>
>>
>> Since this style of coding already has known problems, do you think
>the
>> case with NA-masks deserves more attention here? What will happen is.
>> access to array element data without consideration of the mask, which
>> seems similar in nature to accessing array data with the wrong stride
>or
>> byte order.


I realized something -- I think this is not the most important question to ask.

The question to ask is: what will create a nice, seamless NA-experience for a NumPy user. Can he/she just try to call a function (which may call other functions, which may call...) with a masked array and trust that it is correct or barfs? It's not a question of how much code needs fixing, but of the uncertainty and delay of adoption it'll create that code needs to be verified. With ndmasked, you get a *guarantee* against old code.

(crazy thought: look into whether ob-type can be reassigned after object creation? I wouldn't put it past CPython to pull off a hack like that.)

Dag





>
>I don't agree with the premise of that paragraph. There's no reason to 
>assume that just because code doesn't call FromAny, it has problems. 
>(And I'll continue to assume that whatever array is returned from 
>"np.ascontiguousarray is really contiguous...)
>
>Whether it requires attention or not is a different issue though. I'm 
>not sure. I think other people should weigh in on that -- I mostly
>write 
>code for my own consumption.
>
>One should at least check pandas, scikits-image, scikits-learn, mpi4py,
>
>petsc4py, and so on. And ask on the Cython users list. Hopefully it
>will 
>usually be PEP 3118. But now I need to turn in.
>
>Travis, would such a survey be likely to affect the outcome of your 
>decision in any way? Or should we just leave this for now?
>
>Dag
>
>>
>> Cheers,
>> Mark
>>
>>     Dag
>>
>>      >
>>      > Tutorial From Cython Website
>>      > ----------------------------
>>      >
>>      > http://docs.cython.org/src/tutorial/numpy.html
>>      >
>>      > This tutorial gives a convolution example, and all the
>examples
>>     fail with
>>      > Python exceptions when given inputs that contain NA values.
>>      >
>>      > Before any Cython type annotation is introduced, the code
>>     functions just
>>      > as equivalent Python would in the interpreter.
>>      >
>>      > When the type information is introduced, it is done via
>numpy.pxd
>>     which
>>      > defines a mapping between an ndarray declaration and
>>     PyArrayObject \*.
>>      > Under the hood, this maps to __Pyx_ArgTypeTest, which does a
>direct
>>      > comparison of Py_TYPE(obj) against the PyTypeObject for the
>ndarray.
>>      >
>>      > Then the code does some dtype comparisons, and uses regular
>>     python indexing
>>      > to access the array elements. This python indexing still goes
>>     through the
>>      > Python API, so the NA handling and error checking in numpy
>still
>>     can work
>>      > like normal and fail if the inputs have NAs which cannot fit
>in
>>     the output
>>      > array. In this case it fails when trying to convert the NA
>into
>>     an integer
>>      > to set in in the output.
>>      >
>>      > The next version of the code introduces more efficient
>indexing. This
>>      > operates based on Python's buffer protocol. This causes Cython
>to
>>     call
>>      > __Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which
>calls
>>      > PyObject_GetBuffer. This call gives numpy the opportunity to
>raise an
>>      > exception if the inputs are arrays with NA-masks, something
>not
>>     supported
>>      > by the Python buffer protocol.
>>      >
>>      > Numerical Python - JPL website
>>      > ------------------------------
>>      >
>>      >
>http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html
>>      >
>>      > This document is from 2001, so does not reflect recent numpy,
>but
>>     it is the
>>      > second hit when searching for "numpy c api example" on google.
>>      >
>>      > There first example, heading "A simple example", is in fact
>already
>>      > invalid for
>>      > recent numpy even without the NA support. In particular, if
>the
>>     data is
>>      > misaligned
>>      > or in a different byteorder, it may crash or produce incorrect
>>     results.
>>      >
>>      > The next thing the document does is introduce
>>      > PyArray_ContiguousFromObject, which
>>      > gives numpy an opportunity to raise an exception when
>NA-masked
>>     arrays
>>      > are used,
>>      > so the later code will raise exceptions as desired.
>>      >
>>      >
>>      >
>>      > _______________________________________________
>>      > NumPy-Discussion mailing list
>>      > NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>      > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>     _______________________________________________
>>     NumPy-Discussion mailing list
>>     NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>>     http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>_______________________________________________
>NumPy-Discussion mailing list
>NumPy-Discussion at scipy.org
>http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.



More information about the NumPy-Discussion mailing list