Objects exposing the array interface
An issue was raised yesterday in github, regarding np.may_share_memory when run on a class exposing an array using the __array__ method. You can check the details here: https://github.com/numpy/numpy/issues/5604 Looking into it, I found out that NumPy doesn't really treat objects exposing __array__,, __array_interface__, or __array_struct__ as if they were proper arrays: 1. When converting these objects to arrays using PyArray_Converter, if the arrays returned by any of the array interfaces is not C contiguous, aligned, and writeable, a copy that is will be made. Proper arrays and subclasses are passed unchanged. This is the source of the error reported above. 2. When converting these objects using PyArray_OutputConverter, as well as in similar code in the ufucn machinery, anything other than a proper array or subclass raises an error. This means that, contrary to what the docs on subclassing say, see below, you cannot use an object exposing the array interface as an output parameter to a ufunc The following classes can be used to test this behavior: class Foo: def __init__(self, arr): self.arr = arr def __array__(self): return self.arr class Bar: def __init__(self, arr): self.arr = arr self.__array_interface__ = arr.__array_interface__ class Baz: def __init__(self, arr): self.arr = arr self.__array_struct__ = arr.__array_struct__ They all behave the same with these examples:
a = Foo(np.ones(5)) np.add(a, a) array([ 2., 2., 2., 2., 2.]) np.add.accumulate(a) array([ 1., 2., 3., 4., 5.]) np.add(a, a, out=a) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: return arrays must be of ArrayType np.add.accumulate(a, out=a) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: output must be an array
I think this should be changed, and whatever gets handed by this methods/interfaces be treated as if it were an array or subclass of it. This is actually what the docs on subclassing say about __array__ here: http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.__... This also seems to contradict a rather cryptic comment in the code of PyArray_GetArrayParamsFromObject, which is part of the call sequence of this whole mess, see here: https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/core/src/multiar... /* * If op supplies the __array__ function. * The documentation says this should produce a copy, so * we skip this method if writeable is true, because the intent * of writeable is to modify the operand. * XXX: If the implementation is wrong, and/or if actual * usage requires this behave differently, * this should be changed! */ There has already been some discussion in the issue linked above, but I would appreciate any other thoughts on the idea of treating objects with some form of array interface as if they were arrays. Does it need a deprecation cycle? Is there some case I am not considering where this could go horribly wrong? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
On Wed, Feb 25, 2015 at 1:24 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
1. When converting these objects to arrays using PyArray_Converter, if the arrays returned by any of the array interfaces is not C contiguous, aligned, and writeable, a copy that is will be made. Proper arrays and subclasses are passed unchanged. This is the source of the error reported above.
When converting these objects to arrays using PyArray_Converter, if the arrays returned by any of the array interfaces is not C contiguous, aligned, and writeable, a copy that is will be made. Proper arrays and subclasses are passed unchanged. This is the source of the error reported above. I'm not entirely sure I understand this -- how is PyArray_Convert used in numpy? For example, if I pass a non-contiguous array to your class Foo, np.asarray does not do a copy: In [25]: orig = np.zeros((3, 4))[:2, :3] In [26]: orig.flags Out[26]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [27]: subclass = Foo(orig) In [28]: np.asarray(subclass) Out[28]: array([[ 0., 0., 0.], [ 0., 0., 0.]]) In [29]: np.asarray(subclass)[:] = 1 In [30]: np.asarray(subclass) Out[30]: array([[ 1., 1., 1.], [ 1., 1., 1.]]) But yes, this is probably a bug. 2. When converting these objects using PyArray_OutputConverter, as well as
in similar code in the ufucn machinery, anything other than a proper array or subclass raises an error. This means that, contrary to what the docs on subclassing say, see below, you cannot use an object exposing the array interface as an output parameter to a ufunc
Here it might be a good idea to distinguish between objects that define __array__ vs __array_interface__/__array_struct__. A class that defines __array__ might not be very ndarray-like at all, but rather be something that can be *converted* to an ndarray. For example, objects in pandas define __array__, but updating the return value of df.__array__() in-place will not necessarily update the DataFrame (e.g., if the frame had inhomogeneous dtypes).
On Wed, Feb 25, 2015 at 1:56 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Wed, Feb 25, 2015 at 1:24 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
1. When converting these objects to arrays using PyArray_Converter, if the arrays returned by any of the array interfaces is not C contiguous, aligned, and writeable, a copy that is will be made. Proper arrays and subclasses are passed unchanged. This is the source of the error reported above.
When converting these objects to arrays using PyArray_Converter, if the arrays returned by any of the array interfaces is not C contiguous, aligned, and writeable, a copy that is will be made. Proper arrays and subclasses are passed unchanged. This is the source of the error reported above.
I'm not entirely sure I understand this -- how is PyArray_Convert used in numpy? For example, if I pass a non-contiguous array to your class Foo, np.asarray does not do a copy:
It is used by many (all?) C functions that take an array as input. This follows a different path than what np.asarray or np.asanyarray do, which are calls to np.array, which maps to the C function _array_fromobject which can be found here: https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/core/src/multiar... And ufuncs have their own conversion code, which doesn't really help either. Not sure it would be possible to have them all use a common code base, but it is certainly well worth trying.
In [25]: orig = np.zeros((3, 4))[:2, :3]
In [26]: orig.flags Out[26]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
In [27]: subclass = Foo(orig)
In [28]: np.asarray(subclass) Out[28]: array([[ 0., 0., 0.], [ 0., 0., 0.]])
In [29]: np.asarray(subclass)[:] = 1
In [30]: np.asarray(subclass) Out[30]: array([[ 1., 1., 1.], [ 1., 1., 1.]])
But yes, this is probably a bug.
2. When converting these objects using PyArray_OutputConverter, as well as
in similar code in the ufucn machinery, anything other than a proper array or subclass raises an error. This means that, contrary to what the docs on subclassing say, see below, you cannot use an object exposing the array interface as an output parameter to a ufunc
Here it might be a good idea to distinguish between objects that define __array__ vs __array_interface__/__array_struct__. A class that defines __array__ might not be very ndarray-like at all, but rather be something that can be *converted* to an ndarray. For example, objects in pandas define __array__, but updating the return value of df.__array__() in-place will not necessarily update the DataFrame (e.g., if the frame had inhomogeneous dtypes).
I am not really sure what the behavior of __array__ should be. The link to the subclassing docs I gave before indicates that it should be possible to write to it if it is writeable (and probably pandas should set the writeable flag to False if it cannot be reliably written to), but the obscure comment I mentioned seems to point to the opposite, that it should never be written to. This is probably a good moment in time to figure out what the proper behavior should be and document it. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
On Wed, Feb 25, 2015 at 2:48 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
I am not really sure what the behavior of __array__ should be. The link to the subclassing docs I gave before indicates that it should be possible to write to it if it is writeable (and probably pandas should set the writeable flag to False if it cannot be reliably written to), but the obscure comment I mentioned seems to point to the opposite, that it should never be written to. This is probably a good moment in time to figure out what the proper behavior should be and document it.
It's one thing to rely on the result of __array__ being writeable. It's another thing to rely on writing to that array to modify the original array-like object. Presuming the later would be a mistake. Let me give three categories of examples where I know this would fail: - pandas: for DataFrame objects with inhomogeneous dtype - netCDF4 and other IO libraries: The array's data may be readonly on disk or require a network call to access. The memory model may not even be able to be cleanly mapped to numpy's (e.g., it may use chunked storage) - blaze.Data: Blaze arrays use lazily evaluation and don't support mutation As far as I know, none of these libraries produce readonly ndarray objects from __array__. It can actually be highly convenient to return normal, writeable ndarrays even if they don't modify the original source, because this lets you do all the normal numpy stuff to the returned array, including operations that mutate it.
participants (2)
-
Jaime Fernández del Río
-
Stephan Hoyer