Mailman 3 Revert the return of a single NaN for `np.unique` with floating point numbers? - NumPy-Discussion

Revert the return of a single NaN for `np.unique` with floating point numbers?

Sebastian Berg

Aug. 2, 2021

10:03 a.m.

Hi all, In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique): a = np.array([1, 1, np.nan, np.nan, np.nan]) Before 1.21: >>> np.unique(a) array([ 1., nan, nan, nan]) After 1.21: array([ 1., nan]) This change was requested in an old issue: https://github.com/numpy/numpy/issues/2111 And happened here: https://github.com/numpy/numpy/pull/18070 While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone? Cheers, Sebastian PS: One additional note, is that this does not work for object arrays (it cannot reasonable): >>> np.unique(a.astype(object)) array([1.0, nan, nan, nan], dtype=object)

Attachments:

signature.asc (application/pgp-signature — 833 bytes)

Show replies by date

Ralf Gommers

August 2021

10:49 a.m.

On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...

Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

>>> np.unique(a) array([ 1., nan, nan, nan])

After 1.21:

array([ 1., nan])

This change was requested in an old issue:

https://github.com/numpy/numpy/issues/2111

And happened here:

https://github.com/numpy/numpy/pull/18070

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me. Cheers, Ralf

Matti Picus

1:44 a.m.

On 2/8/21 8:49 pm, Ralf Gommers wrote:

...

On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net <mailto:sebastian@sipsolutions.net>> wrote:

Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

>>> np.unique(a) array([ 1., nan, nan, nan])

After 1.21:

array([ 1., nan])

This change was requested in an old issue:

https://github.com/numpy/numpy/issues/2111 <https://github.com/numpy/numpy/issues/2111>

And happened here:

https://github.com/numpy/numpy/pull/18070 <https://github.com/numpy/numpy/pull/18070>

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me.

Cheers, Ralf

We were discussing this today (me and Matthew) and came up with an edge case when using set(a), it will return the old value. We should add this as a documented "feature" Matti

Ralf Gommers

November 2021

2:08 p.m.

On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...
Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

>>> np.unique(a) array([ 1., nan, nan, nan])

After 1.21:

array([ 1., nan])

This change was requested in an old issue:

https://github.com/numpy/numpy/issues/2111

And happened here:

https://github.com/numpy/numpy/pull/18070

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me.

It turns out there's a lot more to this story. The short summary of it is that: - use case wise it's still true that considering NaN's equal makes more sense. - scikit-learn has a custom implementation which removes duplicate NaN's: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode... - all other array/tensor libraries match NumPy's old behavior (return multiple NaN's) - there's a performance and implementation cost for CuPy et al. to match NumPy's changed behavior (multiple NaN's is the natural result, no extra checks needed) - There is also a significant performance cost for NumPy, ~25% for small/medium-size arrays with no/few NaN's (say the benchmarks in https://github.com/numpy/numpy/pull/18070 - which is a common case, and that's not negligible like the PR description claims. - the "single NaN" behavior is easy to get as a utility function on top of the "multiple NaN' (like scikit-learn does), the opposite is much harder - for the array API standard, the final decision was to go with the "multiple NaN' behavior, so we'll need that in `numpy.array_api`. - more discussion in https://github.com/data-apis/array-api/issues/249 It would be good to make up our minds before 1.22.0. Two choices: 1. We can leave it as it is now and work around the issue in `numpy.array_api`. It would also require CuPy and others to make changes, which probably cost performance. 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a release will be made. There is something to say for either option. I'd personally lean slightly towards reverting because of both consistency with other implementations and performance. But it seems like there's no optimal solution here, it's a fairly hairy issue. Thoughts? Ralf

Juan Nunez-Iglesias

4:44 p.m.

I agree with the argument to revert. Consistency among libraries should be near the top of the list of priorities, as should performance. Whether the new behaviour "makes more sense", meanwhile, is debatable. On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:

...

On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...
Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

>>> np.unique(a) array([ 1., nan, nan, nan])

After 1.21:

array([ 1., nan])

This change was requested in an old issue:

https://github.com/numpy/numpy/issues/2111

And happened here:

https://github.com/numpy/numpy/pull/18070

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me.

It turns out there's a lot more to this story. The short summary of it is that: - use case wise it's still true that considering NaN's equal makes more sense. - scikit-learn has a custom implementation which removes duplicate NaN's: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode... - all other array/tensor libraries match NumPy's old behavior (return multiple NaN's) - there's a performance and implementation cost for CuPy et al. to match NumPy's changed behavior (multiple NaN's is the natural result, no extra checks needed) - There is also a significant performance cost for NumPy, ~25% for small/medium-size arrays with no/few NaN's (say the benchmarks in https://github.com/numpy/numpy/pull/18070 - which is a common case, and that's not negligible like the PR description claims. - the "single NaN" behavior is easy to get as a utility function on top of the "multiple NaN' (like scikit-learn does), the opposite is much harder - for the array API standard, the final decision was to go with the "multiple NaN' behavior, so we'll need that in `numpy.array_api`. - more discussion in https://github.com/data-apis/array-api/issues/249

It would be good to make up our minds before 1.22.0. Two choices: 1. We can leave it as it is now and work around the issue in `numpy.array_api`. It would also require CuPy and others to make changes, which probably cost performance. 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a release will be made.

There is something to say for either option. I'd personally lean slightly towards reverting because of both consistency with other implementations and performance. But it seems like there's no optimal solution here, it's a fairly hairy issue.

Thoughts?

Ralf

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jni@fastmail.com

Sebastian Berg

10:05 a.m.

On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote:

...

I agree with the argument to revert. Consistency among libraries should be near the top of the list of priorities, as should performance. Whether the new behaviour "makes more sense", meanwhile, is debatable.

I don't have much of an opinion either way. But while correctness is debatable, is there any reason someone would prefer getting multiple (potentially very man) NaNs? I admit it feels more like a trap to me. About Performance ----------------- Note that at least on CPUs there is no reason why there even is an overhead [1]! (We simply miss an `np.islessgreater` function.) The overhead is a bit annoying admittedly, although pretty typical for NumPy, I guess. This one seems to be reducible to around 10% if you use `a != a` instead of `isnan` (the ufunc incurs a lot of overhead). Which is on par with 1.19.x, since NumPy got faster ;). About Reverting --------------- It feels like this should have a compat release note, which seems like a very strong flag for "not just a bug fix". But maybe there is a good reason to break with that? Cheers, Sebastian [1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which allow reformulating the whole thing to have exactly the same speed as before, these intrinsics effectively calculate things like: !(a < b) or !(a != b) Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a problem that we already have and is even now not really avoidable). So at least for all relevant CPUs, the implementation detail that we currently don't have `islessgreater`. (I have not found information of whether the C99 `islessgreater` has decent performance on CUDA) (Not surprisingly, MSVC is arguably buggy and does to not compile the C99 `islessgreate` to fast byte-code – although plausibly link-time optimizer may safe that. But since we have hand coded SSE2/AVX versions of comparisons, even that would probably not matter)

...

On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:

...
On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...
Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

    a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

    >>> np.unique(a)     array([ 1., nan, nan, nan])

After 1.21:

    array([ 1., nan])

This change was requested in an old issue:

     https://github.com/numpy/numpy/issues/2111

And happened here:

     https://github.com/numpy/numpy/pull/18070

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me.

It turns out there's a lot more to this story. The short summary of it is that: - use case wise it's still true that considering NaN's equal makes more sense. - scikit-learn has a custom implementation which removes duplicate NaN's: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode... - all other array/tensor libraries match NumPy's old behavior (return multiple NaN's) - there's a performance and implementation cost for CuPy et al. to match NumPy's changed behavior (multiple NaN's is the natural result, no extra checks needed) - There is also a significant performance cost for NumPy, ~25% for small/medium-size arrays with no/few NaN's (say the benchmarks in https://github.com/numpy/numpy/pull/18070 - which is a common case, and that's not negligible like the PR description claims. - the "single NaN" behavior is easy to get as a utility function on top of the "multiple NaN' (like scikit-learn does), the opposite is much harder - for the array API standard, the final decision was to go with the "multiple NaN' behavior, so we'll need that in `numpy.array_api`. - more discussion in https://github.com/data-apis/array-api/issues/249

It would be good to make up our minds before 1.22.0. Two choices: 1. We can leave it as it is now and work around the issue in `numpy.array_api`. It would also require CuPy and others to make changes, which probably cost performance. 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a release will be made.

There is something to say for either option. I'd personally lean slightly towards reverting because of both consistency with other implementations and performance. But it seems like there's no optimal solution here, it's a fairly hairy issue.

Thoughts?

Ralf

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jni@fastmail.com

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

Ralf Gommers

1:56 p.m.

An update on this: after discussion in the community meeting the preferred direction is to add an `equal_nans` keyword and support both behaviors. Given that we won't get to this, I proposed on https://github.com/numpy/numpy/issues/20326 to make the default False, i.e. the same as for 1.21.x Cheers, Ralf On Mon, Nov 8, 2021 at 7:06 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...

On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote:

...
I agree with the argument to revert. Consistency among libraries should be near the top of the list of priorities, as should performance. Whether the new behaviour "makes more sense", meanwhile, is debatable.

I don't have much of an opinion either way. But while correctness is debatable, is there any reason someone would prefer getting multiple (potentially very man) NaNs? I admit it feels more like a trap to me.

About Performance -----------------

Note that at least on CPUs there is no reason why there even is an overhead [1]! (We simply miss an `np.islessgreater` function.)

The overhead is a bit annoying admittedly, although pretty typical for NumPy, I guess. This one seems to be reducible to around 10% if you use `a != a` instead of `isnan` (the ufunc incurs a lot of overhead). Which is on par with 1.19.x, since NumPy got faster ;).

About Reverting ---------------

It feels like this should have a compat release note, which seems like a very strong flag for "not just a bug fix". But maybe there is a good reason to break with that?

Cheers,

Sebastian

[1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which allow reformulating the whole thing to have exactly the same speed as before, these intrinsics effectively calculate things like:

!(a < b) or !(a != b)

Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a problem that we already have and is even now not really avoidable).

So at least for all relevant CPUs, the implementation detail that we currently don't have `islessgreater`. (I have not found information of whether the C99 `islessgreater` has decent performance on CUDA)

(Not surprisingly, MSVC is arguably buggy and does to not compile the C99 `islessgreate` to fast byte-code – although plausibly link-time optimizer may safe that. But since we have hand coded SSE2/AVX versions of comparisons, even that would probably not matter)

...
On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:

...
On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:

...
Hi all,

In NumPy 1.21, the output of `np.unique` changed in the presence of multiple NaNs. Previously, all NaNs were returned when we now only return one (all NaNs were considered unique):

a = np.array([1, 1, np.nan, np.nan, np.nan])

Before 1.21:

>>> np.unique(a) array([ 1., nan, nan, nan])

After 1.21:

array([ 1., nan])

This change was requested in an old issue:

https://github.com/numpy/numpy/issues/2111

And happened here:

https://github.com/numpy/numpy/pull/18070

While, it has a release note. I am not sure the change got the attention it deserved. This would be especially worrying if it is a regression for anyone?

I think it's now the expected answer, not a regression. `unique` is not an elementwise function that needs to adhere to IEEE-754 where nan != nan. I can't remember reviewing this change, but it makes perfect sense to me.

It turns out there's a lot more to this story. The short summary of it is that: - use case wise it's still true that considering NaN's equal makes more sense. - scikit-learn has a custom implementation which removes duplicate NaN's:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode...

...
- all other array/tensor libraries match NumPy's old behavior (return multiple NaN's) - there's a performance and implementation cost for CuPy et al. to match NumPy's changed behavior (multiple NaN's is the natural result, no extra checks needed) - There is also a significant performance cost for NumPy, ~25% for small/medium-size arrays with no/few NaN's (say the benchmarks in https://github.com/numpy/numpy/pull/18070 - which is a common case, and that's not negligible like the PR description claims. - the "single NaN" behavior is easy to get as a utility function on top of the "multiple NaN' (like scikit-learn does), the opposite is much harder - for the array API standard, the final decision was to go with the "multiple NaN' behavior, so we'll need that in `numpy.array_api`. - more discussion in https://github.com/data-apis/array-api/issues/249

It would be good to make up our minds before 1.22.0. Two choices: 1. We can leave it as it is now and work around the issue in `numpy.array_api`. It would also require CuPy and others to make changes, which probably cost performance. 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a release will be made.

There is something to say for either option. I'd personally lean slightly towards reverting because of both consistency with other implementations and performance. But it seems like there's no optimal solution here, it's a fairly hairy issue.

Thoughts?

Ralf

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jni@fastmail.com

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net

_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com

1189

Age (days ago)

1293

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Juan Nunez-Iglesias
Matti Picus
Ralf Gommers
Sebastian Berg

Revert the return of a single NaN for `np.unique` with floating point numbers?

Sebastian Berg

Ralf Gommers

Matti Picus

Ralf Gommers

Juan Nunez-Iglesias

Sebastian Berg

Ralf Gommers

tags

participants (4)