Bug in np.nonzero / Should index returning functions return ndarray subclasses?
There is a reported bug (issue #5837 https://github.com/numpy/numpy/issues/5837) regarding different returns from np.nonzero with 1-D vs higher dimensional arrays. A full summary of the differences can be seen from the following output:
class C(np.ndarray): pass ... a = np.arange(6).view(C) b = np.arange(6).reshape(2, 3).view(C) anz = a.nonzero() bnz = b.nonzero()
type(anz[0])
anz[0].flags C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False anz[0].base
type(bnz[0])
bnz[0].flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : False ALIGNED : True UPDATEIFCOPY : False bnz[0].base array([[0, 1], [0, 2], [1, 0], [1, 1], [1, 2]])
The original bug report was only concerned with the non-writeability of higher dimensional array returns, but there are more differences: 1-D always returns an ndarray that owns its memory and is writeable, but higher dimensional arrays return views, of the type of the original array, that are non-writeable. I have a branch that attempts to fix this by making both 1-D and n-D arrays: 1. return a view, never the base array, 2. return an ndarray, never a subclass, and 3. return a writeable view. I guess the most controversial choice is #2, and in fact making that change breaks a few tests. I nevertheless think that all of the index returning functions (nonzero, argsort, argmin, argmax, argpartition) should always return a bare ndarray, not a subclass. I'd be happy to be corrected, but I can't think of any situation in which preserving the subclass would be needed for these functions. Since we are changing the returns of a few other functions in 1.10 (diagonal, diag, ravel), it may be a good moment to revisit the behavior for these other functions. Any thoughts? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
On May 9, 2015 10:48 AM, "Jaime Fernández del Río"
There is a reported bug (issue #5837) regarding different returns from
np.nonzero with 1-D vs higher dimensional arrays. A full summary of the differences can be seen from the following output:
class C(np.ndarray): pass ... a = np.arange(6).view(C) b = np.arange(6).reshape(2, 3).view(C) anz = a.nonzero() bnz = b.nonzero()
type(anz[0])
anz[0].flags C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False anz[0].base type(bnz[0])
bnz[0].flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : False ALIGNED : True UPDATEIFCOPY : False bnz[0].base array([[0, 1], [0, 2], [1, 0], [1, 1], [1, 2]]) The original bug report was only concerned with the non-writeability of
higher dimensional array returns, but there are more differences: 1-D always returns an ndarray that owns its memory and is writeable, but higher dimensional arrays return views, of the type of the original array, that are non-writeable.
I have a branch that attempts to fix this by making both 1-D and n-D
arrays:
return a view, never the base array,
This doesn't matter, does it? "View" isn't a thing, only "view of" is meaningful. And in this case, none of the returned arrays share any memory with any other arrays that the user has access to... so whether they were created as a view or not should be an implementation detail that's transparent to the user?
return an ndarray, never a subclass, and return a writeable view. I guess the most controversial choice is #2, and in fact making that change breaks a few tests. I nevertheless think that all of the index returning functions (nonzero, argsort, argmin, argmax, argpartition) should always return a bare ndarray, not a subclass. I'd be happy to be corrected, but I can't think of any situation in which preserving the subclass would be needed for these functions.
I also can't see any logical reason why the return type of these functions has anything to do with the type of the inputs. You can index me with my phone number but my phone number is not a person. OTOH logic and ndarray subclassing don't have much to do with each other; the practical effect is probably more important. Looking at the subclasses I know about (masked arrays, np.matrix, and astropy quantities), though, I also can't see much benefit in copying the subclass of the input, and the fact that we were never consistent about this suggests that people probably aren't depending on it too much. So in summary my feeling is: +1 to making then writable, no objection to the view thing (though I don't see how it matters), and provisional +1 to consistently returning ndarray (to be revised if the people who use the subclassing functionality disagree). -n
Absolutely, it should be writable. As for subclassing, that might be messy.
Consider the following:
inds = np.where(data > 5)
In that case, I'd expect a normal, bog-standard ndarray because that is
what you use for indexing (although pandas might have a good argument for
having it return one of their special indexing types if "data" was a pandas
array...). Next:
foobar = np.where(data > 5, 1, 2)
Again, I'd expect a normal, bog-standard ndarray because the scalar
elements are very simple. This question gets very complicated when
considering array arguments. Consider:
merged_data = np.where(data > 5, data, data2)
So, what should "merged_data" be? If both "data" and "data2" are the same
types, then it would be reasonable to return the same type, if possible.
But what if they aren't the same? Maybe use array_priority to determine the
return type? Or, perhaps it does make sense to say "sod it all" and always
return an ndarray?
I don't know the answer. I do find it interesting that the result from a
multi-dimensional array is not writable. I don't know why I have never
encountered that.
Ben Root
On Sat, May 9, 2015 at 2:42 PM, Nathaniel Smith
On May 9, 2015 10:48 AM, "Jaime Fernández del Río"
wrote: There is a reported bug (issue #5837) regarding different returns from
np.nonzero with 1-D vs higher dimensional arrays. A full summary of the differences can be seen from the following output:
class C(np.ndarray): pass ... a = np.arange(6).view(C) b = np.arange(6).reshape(2, 3).view(C) anz = a.nonzero() bnz = b.nonzero()
type(anz[0])
anz[0].flags C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False anz[0].base type(bnz[0])
bnz[0].flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : False ALIGNED : True UPDATEIFCOPY : False bnz[0].base array([[0, 1], [0, 2], [1, 0], [1, 1], [1, 2]]) The original bug report was only concerned with the non-writeability of
higher dimensional array returns, but there are more differences: 1-D always returns an ndarray that owns its memory and is writeable, but higher dimensional arrays return views, of the type of the original array, that are non-writeable.
I have a branch that attempts to fix this by making both 1-D and n-D
arrays:
return a view, never the base array,
This doesn't matter, does it? "View" isn't a thing, only "view of" is meaningful. And in this case, none of the returned arrays share any memory with any other arrays that the user has access to... so whether they were created as a view or not should be an implementation detail that's transparent to the user?
return an ndarray, never a subclass, and return a writeable view. I guess the most controversial choice is #2, and in fact making that change breaks a few tests. I nevertheless think that all of the index returning functions (nonzero, argsort, argmin, argmax, argpartition) should always return a bare ndarray, not a subclass. I'd be happy to be corrected, but I can't think of any situation in which preserving the subclass would be needed for these functions.
I also can't see any logical reason why the return type of these functions has anything to do with the type of the inputs. You can index me with my phone number but my phone number is not a person. OTOH logic and ndarray subclassing don't have much to do with each other; the practical effect is probably more important. Looking at the subclasses I know about (masked arrays, np.matrix, and astropy quantities), though, I also can't see much benefit in copying the subclass of the input, and the fact that we were never consistent about this suggests that people probably aren't depending on it too much.
So in summary my feeling is: +1 to making then writable, no objection to the view thing (though I don't see how it matters), and provisional +1 to consistently returning ndarray (to be revised if the people who use the subclassing functionality disagree).
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On May 9, 2015 12:54 PM, "Benjamin Root"
Absolutely, it should be writable. As for subclassing, that might be
messy. Consider the following:
inds = np.where(data > 5)
In that case, I'd expect a normal, bog-standard ndarray because that is
what you use for indexing (although pandas might have a good argument for having it return one of their special indexing types if "data" was a pandas array...). Pandas doesn't subclass ndarray (anymore), so they're irrelevant to this particular discussion :-). Of course they're an argument for having a cleaner more general way of allowing non-ndarray array-like objects, but the legacy subclassing system will never be that.
Next:
foobar = np.where(data > 5, 1, 2)
Again, I'd expect a normal, bog-standard ndarray because the scalar elements are very simple. This question gets very complicated when considering array arguments. Consider:
merged_data = np.where(data > 5, data, data2)
So, what should "merged_data" be? If both "data" and "data2" are the same types, then it would be reasonable to return the same type, if possible. But what if they aren't the same? Maybe use array_priority to determine the return type? Or, perhaps it does make sense to say "sod it all" and always return an ndarray?
Not sure what this has to do with Jaime's post about nonzero? There is indeed a potential question about what 3-argument where() should do with subclasses, but that's effectively a different operation entirely and to discuss it we'd need to know things like what it historically has done and why that was causing problems. -n
On Sat, May 9, 2015 at 4:03 PM, Nathaniel Smith
Not sure what this has to do with Jaime's post about nonzero? There is indeed a potential question about what 3-argument where() should do with subclasses, but that's effectively a different operation entirely and to discuss it we'd need to know things like what it historically has done and why that was causing problems.
Because my train of thought started at np.nonzero(), which I have always just mentally mapped to np.where(), and then... squirrel! Indeed, np.where() has no bearing here. Ben Root
On Sat, May 9, 2015 at 1:27 PM, Benjamin Root
On Sat, May 9, 2015 at 4:03 PM, Nathaniel Smith
wrote: Not sure what this has to do with Jaime's post about nonzero? There is indeed a potential question about what 3-argument where() should do with subclasses, but that's effectively a different operation entirely and to discuss it we'd need to know things like what it historically has done and why that was causing problems.
Because my train of thought started at np.nonzero(), which I have always just mentally mapped to np.where(), and then... squirrel!
Indeed, np.where() has no bearing here.
Ah, gotcha :-). There is an argument that we should try to reduce this confusion by nudging people to use np.nonzero() consistently instead of np.where(), via the documentation and/or a warning message... -- Nathaniel J. Smith -- http://vorpus.org
With regards to np.where -- shouldn't where be a ufunc, so subclasses or other array-likes can be control its behavior with __numpy_ufunc__? As for the other indexing functions, I don't have a strong opinion about how they should handle subclasses. But it is certainly tricky to attempt to handle handle arbitrary subclasses. I would agree that the least error prone thing to do is usually to return base ndarrays. Better to force subclasses to override methods explicitly.
Agreed that indexing functions should return bare `ndarray`. Note that in
Jaime's PR one can override it anyway by defining __nonzero__. -- Marten
On Sat, May 9, 2015 at 9:53 PM, Stephan Hoyer
With regards to np.where -- shouldn't where be a ufunc, so subclasses or other array-likes can be control its behavior with __numpy_ufunc__?
As for the other indexing functions, I don't have a strong opinion about how they should handle subclasses. But it is certainly tricky to attempt to handle handle arbitrary subclasses. I would agree that the least error prone thing to do is usually to return base ndarrays. Better to force subclasses to override methods explicitly.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Benjamin Root
-
Jaime Fernández del Río
-
Marten van Kerkwijk
-
Nathaniel Smith
-
Stephan Hoyer