There's been some work going on recently on Py2 vs Py3 object comparisons. If you want all the background, see gh-6265 https://github.com/numpy/numpy/issues/6265 and follow the links there.
There is a half baked PR in the works, gh-6269 https://github.com/numpy/numpy/pull/6269, that tries to unify behavior and fix some bugs along the way, by replacing all 2.x uses of PyObject_Compare with several calls to PyObject_RichCompareBool, which is available on 2.6, the oldest Python version we support.
The poster child for this example is computing np.sign on an object array that has an np.nan entry. 2.x will just make up an answer for us:
cmp(np.nan, 0)
-1
even though none of the relevant compares succeeds:
np.nan < 0
False
np.nan > 0
False
np.nan == 0
False
The current 3.x is buggy, so the fact that it produces the same made up result as in 2.x is accidental:
np.sign(np.array([np.nan], 'O'))
array([-1], dtype=object)
Looking at the code, it seems that the original intention was for the answer to be `0`, which is equally made up but perhaps makes a little more sense.
There are three ways of fixing this that I see:
1. Arbitrarily choose a value to set the return to. This is equivalent to choosing a default return for `cmp` for comparisons. This preserves behavior, but feels wrong. 2. Similarly to how np.sign of a floating point array with nans returns nan for those values, return e,g, None for these cases. This is my preferred option. 3. Raise an error, along the lines of the TypeError: unorderable types that 3.x produces for some comparisons.
Thoughts anyone?
Jaime
On So, 2015-08-30 at 21:09 -0700, Jaime Fernández del Río wrote:
There's been some work going on recently on Py2 vs Py3 object comparisons. If you want all the background, see gh-6265 and follow the links there.
There is a half baked PR in the works, gh-6269, that tries to unify behavior and fix some bugs along the way, by replacing all 2.x uses of PyObject_Compare with several calls to PyObject_RichCompareBool, which is available on 2.6, the oldest Python version we support.
The poster child for this example is computing np.sign on an object array that has an np.nan entry. 2.x will just make up an answer for us:
cmp(np.nan, 0)
-1
even though none of the relevant compares succeeds:
np.nan < 0
False
np.nan > 0
False
np.nan == 0
False
The current 3.x is buggy, so the fact that it produces the same made up result as in 2.x is accidental:
np.sign(np.array([np.nan], 'O'))
array([-1], dtype=object)
Looking at the code, it seems that the original intention was for the answer to be `0`, which is equally made up but perhaps makes a little more sense.
There are three ways of fixing this that I see: 1. Arbitrarily choose a value to set the return to. This is equivalent to choosing a default return for `cmp` for comparisons. This preserves behavior, but feels wrong. 2. Similarly to how np.sign of a floating point array with nans returns nan for those values, return e,g, None for these cases. This is my preferred option.
That would be my gut feeling as well. Returning `NaN` could also make sense, but I guess we run into problems since we do not know the input type. So `None` seems like the only option here I can think of right now.
- Sebastian
1. Raise an error, along the lines of the TypeError: unorderable types that 3.x produces for some comparisons.
Thoughts anyone?
Jaime
(__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Aug 31, 2015 at 1:23 AM, Sebastian Berg sebastian@sipsolutions.net wrote:
That would be my gut feeling as well. Returning `NaN` could also make sense, but I guess we run into problems since we do not know the input type. So `None` seems like the only option here I can think of right now.
My inclination is that return NaN would be the appropriate choice. It's certainly consistent with the behavior for float dtypes -- my expectation for object dtype behavior is that it works exactly like applying the np.sign ufunc to each element of the array individually.
On the other hand, I suppose there are other ways in which an object can fail all those comparisons (e.g., NaT?), so I suppose we could return None. But it would still be a weird outcome for the most common case. Ideally, I suppose, np.sign would return an array with int-NA dtype, but that's a whole different can of worms...
Stephan
On Mon, 31 Aug 2015 10:23:10 -0700 Stephan Hoyer shoyer@gmail.com wrote:
My inclination is that return NaN would be the appropriate choice. It's certainly consistent with the behavior for float dtypes -- my expectation for object dtype behavior is that it works exactly like applying the np.sign ufunc to each element of the array individually.
On the other hand, I suppose there are other ways in which an object can fail all those comparisons (e.g., NaT?), so I suppose we could return None.
Currently:
np.sign(np.timedelta64('nat'))
numpy.timedelta64(-1)
... probably because NaT is -2**63 under the hood. But in this case returning NaT would sound better.
Regards
Antoine.
On Mon, Aug 31, 2015 at 10:31 AM, Antoine Pitrou solipsis@pitrou.net wrote:
On Mon, 31 Aug 2015 10:23:10 -0700 Stephan Hoyer shoyer@gmail.com wrote:
My inclination is that return NaN would be the appropriate choice. It's certainly consistent with the behavior for float dtypes -- my expectation for object dtype behavior is that it works exactly like applying the np.sign ufunc to each element of the array individually.
On the other hand, I suppose there are other ways in which an object can fail all those comparisons (e.g., NaT?), so I suppose we could return None.
Currently:
np.sign(np.timedelta64('nat'))
numpy.timedelta64(-1)
... probably because NaT is -2**63 under the hood. But in this case returning NaT would sound better.
I think this is going through the np.sign timedelta64 loop, and thus is an unrelated issue? It does look like a bug though.
-n
On Mo, 2015-08-31 at 10:23 -0700, Stephan Hoyer wrote:
On Mon, Aug 31, 2015 at 1:23 AM, Sebastian Berg sebastian@sipsolutions.net wrote: That would be my gut feeling as well. Returning `NaN` could also make
sense, but I guess we run into problems since we do not know the input type. So `None` seems like the only option here I can think of right now.
My inclination is that return NaN would be the appropriate choice. It's certainly consistent with the behavior for float dtypes -- my expectation for object dtype behavior is that it works exactly like applying the np.sign ufunc to each element of the array individually.
I was wondering a bit if returning the original object could make sense. It would work for NaN (and also decimal versions of NaN, etc.). But I am not sure in general.
- Sebastian
On the other hand, I suppose there are other ways in which an object can fail all those comparisons (e.g., NaT?), so I suppose we could return None. But it would still be a weird outcome for the most common case. Ideally, I suppose, np.sign would return an array with int-NA dtype, but that's a whole different can of worms...
Stephan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Aug 30, 2015 at 9:09 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
There are three ways of fixing this that I see:
- Arbitrarily choose a value to set the return to. This is equivalent
to choosing a default return for `cmp` for comparisons. This preserves behavior, but feels wrong. 2. Similarly to how np.sign of a floating point array with nans returns nan for those values, return e,g, None for these cases. This is my preferred option. 3. Raise an error, along the lines of the TypeError: unorderable types that 3.x produces for some comparisons.
Having read the other replies so far -- given that no-one seems to have
any clear intuition or use cases, I guess I find option 3 somewhat tempting... it keeps our options open until someone who actually cares comes along with a use case to hone our intuition on, and is very safe in the mean time.
(This was noticed in the course of routine code cleanups, right, not an external bug report? For all we know right now, no actual user has ever even tried to apply np.sign to an object array?)
-n
On Mon, Aug 31, 2015 at 11:49 PM, Nathaniel Smith njs@pobox.com wrote:
On Sun, Aug 30, 2015 at 9:09 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
There are three ways of fixing this that I see:
- Arbitrarily choose a value to set the return to. This is
equivalent to choosing a default return for `cmp` for comparisons. This preserves behavior, but feels wrong. 2. Similarly to how np.sign of a floating point array with nans returns nan for those values, return e,g, None for these cases. This is my preferred option. 3. Raise an error, along the lines of the TypeError: unorderable types that 3.x produces for some comparisons.
Having read the other replies so far -- given that no-one seems to have
any clear intuition or use cases, I guess I find option 3 somewhat tempting... it keeps our options open until someone who actually cares comes along with a use case to hone our intuition on, and is very safe in the mean time.
(This was noticed in the course of routine code cleanups, right, not an external bug report? For all we know right now, no actual user has ever even tried to apply np.sign to an object array?)
We do have a user that tried np.sign on an object array, and discovered that our Py3K object comparison was crap: https://github.com/numpy/numpy/issues/6229
No report of anyone trying np.sign on anything other than numbers that we know of, though.
I'm starting to think that, given the lack of agreement, I thinking I am going to agree with you that raising an error may be the better option, because it's the least likely to break people's code if we later find we need to change it.
Jaime
On 08/31/2015 12:09 AM, Jaime Fernández del Río wrote:
There are three ways of fixing this that I see:
- Arbitrarily choose a value to set the return to. This is equivalent to choosing a default return for `cmp` for comparisons. This preserves behavior, but feels wrong.
- Similarly to how np.sign of a floating point array with nans returns nan for those values, return e,g, None for these cases. This is my preferred option.
- Raise an error, along the lines of the TypeError: unorderable types that 3.x produces for some comparisons.
I think np.sign on nan object arrays should raise the error
AttributeError: 'float' object has no attribute 'sign'
If I've understood correctly, currently object arrays work like this:
If a ufunc has an equivalent pure-python func (eg, PyNumber_Add for np.add, PyNumber_Absolute for np.abs, < for np.greater_than) then numpy calls that for objects. Otherwise, if the object defines a method with the same name as the ufunc, numpy calls that method. For example, arccos is a ufunc that has no pure python equivalent, so you get the following behavior
>>> a = np.array([-1], dtype='O') >>> np.abs(a) array([1], dtype=object) >>> np.arccos(a) AttributeError: 'int' object has no attribute 'arccos' >>> class MyClass: ... def arccos(self): ... return 1 >>> b = np.array([MyClass()], dtype='O') >>> np.arccos(b) array([1], dtype=object)
Now, most comparison operators (eg, greater_than) are treated a little specially in loops.c. For some reason, sign is treated just like the other comparison operators, even through technically there is no pure-python equivalent to sign.
I think that because there is no pure-python 'sign', numpy should attempt to call obj.sign, and in most cases this should fail with the error above. See also http://stackoverflow.com/questions/1986152/why-doesnt-python-have-a-sign-fun...
I think the fix for sign is that the 'sign' ufunc in generate_umath.py should look more like the arccos one, and we should get rid of OBJECT_sign in loops.c. I'm not 100% sure about this since I haven't followed all of how generate_umath.py works yet.
-------
By the way, based on some comments I saw somewhere (apologies, I forget who by!) I wrote up a vision for how ufuncs could work for objects, here: https://gist.github.com/ahaldane/c3f9bcf1f62d898be7c7 I'm a little unsure the ideas there are a good idea since they might be made obsolete by the big dtype subclassing improvements being discussed in the numpy roadmap thread.
Allan