[Numpy-discussion] NA-mask interactions with existing C code
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Thu May 10 20:35:57 EDT 2012
Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>On 05/11/2012 01:06 AM, Mark Wiebe wrote:
>> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>>
>wrote:
>>
>> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>> > I did some searching for typical Cython and C code which
>accesses
>> numpy
>> > arrays, and added a section to the NEP describing how they
>behave
>> in the
>> > current implementation. Cython code which uses either straight
>Python
>> > access or the buffer protocol is fine (after a bugfix in
>numpy, it
>> > wasn't failing currently as it should in the pep3118 case). C
>> code which
>> > follows the recommended practice of using PyArray_FromAny or
>one
>> of the
>> > related macros is also fine, because these functions have been
>> made to
>> > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
>> provided.
>> >
>> > In general, code which follows the recommended numpy practices
>will
>> > raise exceptions when encountering NA-masked arrays. This
>means
>> > programmers don't have to worry about the NA unless they want
>to
>> support
>> > it. Having things go through PyArray_FromAny also provides a
>> place where
>> > lazy evaluation arrays could be evaluated, and other similar
>> potential
>> > future extensions can use to provide compatibility.
>> >
>> > Here's the section I added to the NEP:
>> >
>> > Interaction With Pre-existing C API Usage
>> > =========================================
>> >
>> > Making sure existing code using the C API, whether it's
>written
>> in C, C++,
>> > or Cython, does something reasonable is an important goal of
>this
>> > implementation.
>> > The general strategy is to make existing code which does not
>> explicitly
>> > tell numpy it supports NA masks fail with an exception saying
>so.
>> There are
>> > a few different access patterns people use to get ahold of the
>numpy
>> > array data,
>> > here we examine a few of them to see what numpy can do. These
>> examples are
>> > found from doing google searches of numpy C API array access.
>> >
>> > Numpy Documentation - How to extend NumPy
>> > -----------------------------------------
>> >
>> >
>>
>http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>> >
>> > This page has a section "Dealing with array objects" which has
>some
>> > advice for how
>> > to access numpy arrays from C. When accepting arrays, the
>first
>> step it
>> > suggests is
>> > to use PyArray_FromAny or a macro built on that function, so
>code
>> > following this
>> > advice will properly fail when given an NA-masked array it
>> doesn't know
>> > how to handle.
>> >
>> > The way this is handled is that PyArray_FromAny requires a
>> special flag,
>> > NPY_ARRAY_ALLOWNA,
>> > before it will allow NA-masked arrays to flow through.
>> >
>> >
>>
>http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>> >
>> > Code which does not follow this advice, and instead just calls
>> > PyArray_Check() to verify
>> > its an ndarray and checks some flags, will silently produce
>incorrect
>> > results. This style
>> > of code does not provide any opportunity for numpy to say
>"hey, this
>> > array is special",
>> > so also is not compatible with future ideas of lazy
>evaluation,
>> derived
>> > dtypes, etc.
>>
>> This doesn't really cover the Cython code I write that interfaces
>with C
>> (and probably the code others write in Cython).
>>
>> Often I'd do:
>>
>> def f(arg):
>> cdef np.ndarray arr = np.asarray(arg)
>> c_func(np.PyArray_DATA(arr))
>>
>> So I mix Python np.asarray with C PyArray_DATA. In general, I
>think you
>> use PyArray_FromAny if you're very concerned about performance or
>need
>> some special flag, but it's certainly not the first thing you
>tgry.
>>
>>
>> I guess this mixture of Python-API and C-API is different from the
>way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use.
>From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there
>is
>> a reasonable approach here which works for everything.
>
>Does that mean you consider changing ob_type for masked arrays
>unreasonable? They can still use the same object struct...
>
>>
>> But in general, I will often be lazy and just do
>>
>> def f(np.ndarray arr):
>> c_func(np.PyArray_DATA(arr))
>>
>> It's an exception if you don't provide an array -- so who cares.
>(I
>> guess the odds of somebody feeding a masked array to code like
>that,
>> which doesn't try to be friendly, is relatively smaller though.)
>>
>>
>> This code would already fail with non-contiguous strides or
>byte-swapped
>> data, so the additional NA mask case seems to fit in an
>already-failing
>> category.
>
>Honestly! I hope you did't think I provided a full-fledged example?
>Perhaps you'd like to point out to me that "c_func" is a bad name for a
>
>function as well?
>
>One would of course check that things are contiguous (or pass on the
>strides), check the dtype and dispatch to different C functions in each
>
>case, etc.
>
>But that isn't the point. Scientific code most of the time does fall in
>
>the "already-failing" category. That doesn't mean it doesn't count.
>Let's focus on the number of code lines written and developer hours
>that
>will be spent cleaning up the mess -- not the "validity" of the code in
>
>question.
>
>>
>>
>> If you know the datatype, you can really do
>>
>> def f(np.ndarray[double] arr):
>> c_func(&arr[0])
>>
>> which works with PEP 3118. But I use PyArray_DATA out of habit
>(and
>> since it works in the cases without dtype).
>>
>> Frankly, I don't expect any Cython code to do the right thing
>here;
>> calling PyArray_FromAny is much more typing. And really, nobody
>ever
>> questioned that if we had an actual ndarray instance, we'd be
>allowed to
>> call PyArray_DATA.
>>
>> I don't know how much Cython code is out there in the wild for
>which
>> this is a problem. Either way, it would cause something of a
>reeducation
>> challenge for Cython users.
>>
>>
>> Since this style of coding already has known problems, do you think
>the
>> case with NA-masks deserves more attention here? What will happen is.
>> access to array element data without consideration of the mask, which
>> seems similar in nature to accessing array data with the wrong stride
>or
>> byte order.
I realized something -- I think this is not the most important question to ask.
The question to ask is: what will create a nice, seamless NA-experience for a NumPy user. Can he/she just try to call a function (which may call other functions, which may call...) with a masked array and trust that it is correct or barfs? It's not a question of how much code needs fixing, but of the uncertainty and delay of adoption it'll create that code needs to be verified. With ndmasked, you get a *guarantee* against old code.
(crazy thought: look into whether ob-type can be reassigned after object creation? I wouldn't put it past CPython to pull off a hack like that.)
Dag
>
>I don't agree with the premise of that paragraph. There's no reason to
>assume that just because code doesn't call FromAny, it has problems.
>(And I'll continue to assume that whatever array is returned from
>"np.ascontiguousarray is really contiguous...)
>
>Whether it requires attention or not is a different issue though. I'm
>not sure. I think other people should weigh in on that -- I mostly
>write
>code for my own consumption.
>
>One should at least check pandas, scikits-image, scikits-learn, mpi4py,
>
>petsc4py, and so on. And ask on the Cython users list. Hopefully it
>will
>usually be PEP 3118. But now I need to turn in.
>
>Travis, would such a survey be likely to affect the outcome of your
>decision in any way? Or should we just leave this for now?
>
>Dag
>
>>
>> Cheers,
>> Mark
>>
>> Dag
>>
>> >
>> > Tutorial From Cython Website
>> > ----------------------------
>> >
>> > http://docs.cython.org/src/tutorial/numpy.html
>> >
>> > This tutorial gives a convolution example, and all the
>examples
>> fail with
>> > Python exceptions when given inputs that contain NA values.
>> >
>> > Before any Cython type annotation is introduced, the code
>> functions just
>> > as equivalent Python would in the interpreter.
>> >
>> > When the type information is introduced, it is done via
>numpy.pxd
>> which
>> > defines a mapping between an ndarray declaration and
>> PyArrayObject \*.
>> > Under the hood, this maps to __Pyx_ArgTypeTest, which does a
>direct
>> > comparison of Py_TYPE(obj) against the PyTypeObject for the
>ndarray.
>> >
>> > Then the code does some dtype comparisons, and uses regular
>> python indexing
>> > to access the array elements. This python indexing still goes
>> through the
>> > Python API, so the NA handling and error checking in numpy
>still
>> can work
>> > like normal and fail if the inputs have NAs which cannot fit
>in
>> the output
>> > array. In this case it fails when trying to convert the NA
>into
>> an integer
>> > to set in in the output.
>> >
>> > The next version of the code introduces more efficient
>indexing. This
>> > operates based on Python's buffer protocol. This causes Cython
>to
>> call
>> > __Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which
>calls
>> > PyObject_GetBuffer. This call gives numpy the opportunity to
>raise an
>> > exception if the inputs are arrays with NA-masks, something
>not
>> supported
>> > by the Python buffer protocol.
>> >
>> > Numerical Python - JPL website
>> > ------------------------------
>> >
>> >
>http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html
>> >
>> > This document is from 2001, so does not reflect recent numpy,
>but
>> it is the
>> > second hit when searching for "numpy c api example" on google.
>> >
>> > There first example, heading "A simple example", is in fact
>already
>> > invalid for
>> > recent numpy even without the NA support. In particular, if
>the
>> data is
>> > misaligned
>> > or in a different byteorder, it may crash or produce incorrect
>> results.
>> >
>> > The next thing the document does is introduce
>> > PyArray_ContiguousFromObject, which
>> > gives numpy an opportunity to raise an exception when
>NA-masked
>> arrays
>> > are used,
>> > so the later code will raise exceptions as desired.
>> >
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org <mailto:NumPy-Discussion at scipy.org>
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>_______________________________________________
>NumPy-Discussion mailing list
>NumPy-Discussion at scipy.org
>http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
More information about the NumPy-Discussion
mailing list