Mailman 3 sorting -inf, nan, inf - NumPy-Discussion

In numpy 1.0b1 nan is always left where it began:

EXAMPLE 1

...
...
x

matrix([[ inf], [ -inf], [ nan]])

...
...
x[x.argsort(0),:]

matrix([[ -inf], [ inf], [ nan]])

EXAMPLE 2

...
...
x

matrix([[ nan], [ inf], [ -inf]])

...
...
x[x.argsort(0),:]

matrix([[ nan], [ -inf], [ inf]])

I would like nan to be in the middle between -inf and inf.

------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Charles R Harris

2:52 p.m.

On 9/19/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

Keith Goodman wrote:

...
In what order would you like argsort to sort the values -inf, nan, inf?

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

-tim

That sounds right. Nan can't be compared because, well, it is undefined. For instance: In [73]: a = arange(2) In [74]: a/0 Out[74]: array([0, 0]) In [75]: a/0.0 Out[75]: array([ nan, inf]) In [76]: a/(-0.0) Out[76]: array([ nan, -inf]) I.e., 0.0/0.0 is undefined. But unless the hardware generates the exception I would be loath to introduce checks in the code. Putting a check in the innermost loop of the sorts would cost significant time. But look at the integer division by zero where the IEEE stuff has no relevence. That sure looks like a bug to me. <snip> Chuck

Sasha

2:57 p.m.

On 9/19/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

I disagree. NumPy sort leaving nan's where they are is probably just a side-effect of nans unusual properties (both nan<x and nan>x is always false), but it is a logical choice. Checking for nan will inevitably slow the most common use case. If you want an exception, just check isnan(x).any() before sort.

A. M. Archibald

3:01 p.m.

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

Keith Goodman wrote:

...
In what order would you like argsort to sort the values -inf, nan, inf?

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

-tim

Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data. I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out. What's happening now is that NaN<a, NaN==a, and NaN>a are all false, which rather confuses the sorting algorithm. But as long as it doesn't actually *break* (that is, leave some of the non-NaNs incorrectly sorted) I don't care. A. M. Archibald

Keith Goodman

3:06 p.m.

On 9/19/06, A. M. Archibald <peridot.faceted@gmail.com> wrote:

...

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
Keith Goodman wrote:

...
In what order would you like argsort to sort the values -inf, nan, inf?

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

-tim

Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data. I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out.

Is there an easy way to use isnan to pull out the nans if the matrix I am sorting has more than one column?

Sasha

3:13 p.m.

On 9/19/06, Keith Goodman <kwgoodman@gmail.com> wrote:

...

Is there an easy way to use isnan to pull out the nans if the matrix I am sorting has more than one column?

There seems to be a "nan_to_num" function that converts nans to zeros, but I would suggest just using fancy indexing to fill the nans with appropriate values: x[isnan(x)] = inf # if you want nans on the right

Tim Hochberg

3:21 p.m.

A. M. Archibald wrote:

...

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
Keith Goodman wrote:

...
In what order would you like argsort to sort the values -inf, nan, inf?

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

-tim

Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data. Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

...

I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out.

Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or doing one or the other depending on the error mode settings would be even better if it is practical.

...

What's happening now is that NaN<a, NaN==a, and NaN>a are all false, which rather confuses the sorting algorithm. But as long as it doesn't actually *break* (that is, leave some of the non-NaNs incorrectly sorted) I don't care.

Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to be no. Try running this a couple of times to see what I mean: n = 10 for i in range(10): for kind in 'quicksort', 'heapsort', 'mergesort': a = rand(n) b = a.copy() a[n//2] = nan a.sort(kind=kind) b.sort(kind=kind) assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b) The values don't correctly cross the inserted NaN and the sort is incorrect. -tim

Charles R Harris

3:56 p.m.

On 9/19/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

A. M. Archibald wrote:

...
On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
Keith Goodman wrote:

...
In what order would you like argsort to sort the values -inf, nan, inf?

Ideally, -inf should sort first, inf should sort last and nan should raise an exception if present.

-tim

Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data. Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

...
I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out.

Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or doing one or the other depending on the error mode settings would be even better if it is practical.

...
What's happening now is that NaN<a, NaN==a, and NaN>a are all false, which rather confuses the sorting algorithm. But as long as it doesn't actually *break* (that is, leave some of the non-NaNs incorrectly sorted) I don't care.

Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to be no. Try running this a couple of times to see what I mean:

n = 10 for i in range(10): for kind in 'quicksort', 'heapsort', 'mergesort': a = rand(n) b = a.copy() a[n//2] = nan a.sort(kind=kind) b.sort(kind=kind) assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b)

The values don't correctly cross the inserted NaN and the sort is incorrect.

Except for heapsort those are doing insertion sorts because n is so small. Heapsort is trickier to understand because of the way the heap is formed and values pulled off. I'll look into that. I bet searchsorted gets messed up by nan's. Do you think it worthwhile to worry about it? Chuck

Tim Hochberg

4:04 p.m.

Charles R Harris wrote:

...

On 9/19/06, *Tim Hochberg* <tim.hochberg@ieee.org <mailto:tim.hochberg@ieee.org>> wrote:

A. M. Archibald wrote: > On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org <mailto:tim.hochberg@ieee.org>> wrote: > >> Keith Goodman wrote: >> >>> In what order would you like argsort to sort the values -inf, nan, inf? >>> >>> >> Ideally, -inf should sort first, inf should sort last and nan should >> raise an exception if present. >> >> -tim >> > > Mmm. Somebody who's working with NaNs has more or less already decided > they don't want to be pestered with exceptions for invalid data. Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

> I'd > be happy if they wound up at either end, but I'm not sure it's worth > hacking up the sort algorithm when a simple isnan() can pull them out. > Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or doing one or the other depending on the error mode settings would be even better if it is practical.

> What's happening now is that NaN<a, NaN==a, and NaN>a are all false, > which rather confuses the sorting algorithm. But as long as it doesn't > actually *break* (that is, leave some of the non-NaNs incorrectly > sorted) I don't care. > Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to be no. Try running this a couple of times to see what I mean:

n = 10 for i in range(10): for kind in 'quicksort', 'heapsort', 'mergesort': a = rand(n) b = a.copy() a[n//2] = nan a.sort(kind=kind) b.sort(kind=kind) assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b)

The values don't correctly cross the inserted NaN and the sort is incorrect.

Except for heapsort those are doing insertion sorts because n is so small. Heapsort is trickier to understand because of the way the heap is formed and values pulled off.

I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

...

I'll look into that. I bet searchsorted gets messed up by nan's. Do you think it worthwhile to worry about it?

No. But that's because it's my contention is that any sequence with NaNs in it is *not sorted* and thus isn't suitable input for searchsorted. -tim

A. M. Archibald

4:15 p.m.

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Not at all. Just temporarily make NaNs compare greater than any other floating-point value and use quicksort (say). You can even do this for python lists without much trouble. That's actually a viable suggestion for numpy's sorting, although it would be kind of ugly to implement: do a quick any(isnan(A)), and if not, use the fast stock sorting algorithms; if there is a NaN somewhere in the array, use a version of the sort that has a tweaked comparison function so the NaNs wind up at the end and are easy to trim off. But the current situation, silently returning arrays in which the non-NaNs are unsorted, is really bad. A. M. Archibald

Tim Hochberg

4:40 p.m.

A. M. Archibald wrote:

...

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Not at all. Just temporarily make NaNs compare greater than any other floating-point value and use quicksort (say). You can even do this for python lists without much trouble.

I misspoke. What I meant here was keeping the behavior that people think that we already have but don't: NaNs stay in place and everything is sorted around them. And even that's not true, since you could just record where the NaNs are, remove them, sort and put them back. What I was really getting at was, that I'm guessing, and it's just a guess, that (a) none of the fast sorting algorithms do anything sensible unless special cased and (b) one could come up with a naive n**2 sort that does do something sensible without special casing (where sensible means leave the NaNs alone).

...

That's actually a viable suggestion for numpy's sorting, although it would be kind of ugly to implement: do a quick any(isnan(A)), and if not, use the fast stock sorting algorithms; if there is a NaN somewhere in the array, use a version of the sort that has a tweaked comparison function so the NaNs wind up at the end and are easy to trim off.

But the current situation, silently returning arrays in which the non-NaNs are unsorted, is really bad.

If your going to do isnan anyway, why not just raise an exception. An array with NaNs in it can't be sorted by any common sense definition of sorting. Any treatment of NaNs is going to be arbitrary, so we might as well make the user specify what they want. "In the face of ambiguity, refuse the temptation to guess" and all that. My favorite solution would be to make sort respect the invalid mode of seterr/geterr. However at the moment it doesn't seem to (in beta4 at least) but neither does add or multiply so those probably need to be looked at again.... -tim

Travis Oliphant

4:48 p.m.

Tim Hochberg wrote:

...

A. M. Archibald wrote:

...
On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Not at all. Just temporarily make NaNs compare greater than any other floating-point value and use quicksort (say). You can even do this for python lists without much trouble.

I misspoke. What I meant here was keeping the behavior that people think that we already have but don't: NaNs stay in place and everything is sorted around them. And even that's not true, since you could just record where the NaNs are, remove them, sort and put them back. What I was really getting at was, that I'm guessing, and it's just a guess, that (a) none of the fast sorting algorithms do anything sensible unless special cased and (b) one could come up with a naive n**2 sort that does do something sensible without special casing (where sensible means leave the NaNs alone).

...
That's actually a viable suggestion for numpy's sorting, although it would be kind of ugly to implement: do a quick any(isnan(A)), and if not, use the fast stock sorting algorithms; if there is a NaN somewhere in the array, use a version of the sort that has a tweaked comparison function so the NaNs wind up at the end and are easy to trim off.

But the current situation, silently returning arrays in which the non-NaNs are unsorted, is really bad.

If your going to do isnan anyway, why not just raise an exception. An array with NaNs in it can't be sorted by any common sense definition of sorting. Any treatment of NaNs is going to be arbitrary, so we might as well make the user specify what they want. "In the face of ambiguity, refuse the temptation to guess" and all that.

My favorite solution would be to make sort respect the invalid mode of seterr/geterr. However at the moment it doesn't seem to (in beta4 at least) but neither does add or multiply so those probably need to be looked at again....

The geterr/seterr stuff changes how IEEE hardware flags are handled in ufuncs. Currently they are not even looked at elsewhere. Are you saying that add and multiply don't respect the invalid flag? If not, then this might be hardware related. Does the IEEE invalid hardware flag get raised on multiplication by nan or only on creation of nan? All the seterr/geterr stuff relies on the hardware flags. We don't do any other checking. -Travis

Tim Hochberg

5:12 p.m.

Travis Oliphant wrote:

...

Tim Hochberg wrote:

...
A. M. Archibald wrote:

...
On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Not at all. Just temporarily make NaNs compare greater than any other floating-point value and use quicksort (say). You can even do this for python lists without much trouble.

I misspoke. What I meant here was keeping the behavior that people think that we already have but don't: NaNs stay in place and everything is sorted around them. And even that's not true, since you could just record where the NaNs are, remove them, sort and put them back. What I was really getting at was, that I'm guessing, and it's just a guess, that (a) none of the fast sorting algorithms do anything sensible unless special cased and (b) one could come up with a naive n**2 sort that does do something sensible without special casing (where sensible means leave the NaNs alone).

...
That's actually a viable suggestion for numpy's sorting, although it would be kind of ugly to implement: do a quick any(isnan(A)), and if not, use the fast stock sorting algorithms; if there is a NaN somewhere in the array, use a version of the sort that has a tweaked comparison function so the NaNs wind up at the end and are easy to trim off.

But the current situation, silently returning arrays in which the non-NaNs are unsorted, is really bad.

If your going to do isnan anyway, why not just raise an exception. An array with NaNs in it can't be sorted by any common sense definition of sorting. Any treatment of NaNs is going to be arbitrary, so we might as well make the user specify what they want. "In the face of ambiguity, refuse the temptation to guess" and all that.

My favorite solution would be to make sort respect the invalid mode of seterr/geterr. However at the moment it doesn't seem to (in beta4 at least) but neither does add or multiply so those probably need to be looked at again....

The geterr/seterr stuff changes how IEEE hardware flags are handled in ufuncs. Currently they are not even looked at elsewhere. Are you saying that add and multiply don't respect the invalid flag? If not, then this might be hardware related. Does the IEEE invalid hardware flag get raised on multiplication by nan or only on creation of nan? It looks like I jumped to conclusions here. I was expecting that with invalid set to 'raise' that an array that (+-*operations on nans would raise an exception. It appears that is incorrect -- only operations that create nans from non-nans will trigger this as you suspected. Similarly, I expected that over='raise' would cause inf+something to raise an exception. Again, not true; this raises an exception only when a new inf is created from non infs. At least on my box.

Interesting. And a little surprising. An interesting tidbit (in an array) inf/inf will raise invalid, but nan/nan will not, it just returns nan. Fun, fun fun!

...

All the seterr/geterr stuff relies on the hardware flags. We don't do any other checking.

Yeah. And just by itself this isn't going to do anything for sort since comparing nans will not set any flags, it's just that the result will be problematic.If one wanted to use these flags for this one would have to use/abuse the result of geterr to trigger an isnan test at the beginning of sort and then warn, raise or call as appropriate. It would probably also not be unreasonable to punt and document sort as failing in the presence of nans. -tim

Charles R Harris

5:55 p.m.

On 9/19/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

...
Tim Hochberg wrote:

...
A. M. Archibald wrote:

...
On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Not at all. Just temporarily make NaNs compare greater than any other floating-point value and use quicksort (say). You can even do this for python lists without much trouble.

I misspoke. What I meant here was keeping the behavior that people

Travis Oliphant wrote: think

...
...
that we already have but don't: NaNs stay in place and everything is sorted around them. And even that's not true, since you could just record where the NaNs are, remove them, sort and put them back. What I was really getting at was, that I'm guessing, and it's just a guess, that (a) none of the fast sorting algorithms do anything sensible unless special cased and (b) one could come up with a naive n**2 sort that does do something sensible without special casing (where sensible means leave the NaNs alone).

...
That's actually a viable suggestion for numpy's sorting, although it would be kind of ugly to implement: do a quick any(isnan(A)), and if not, use the fast stock sorting algorithms; if there is a NaN somewhere in the array, use a version of the sort that has a tweaked comparison function so the NaNs wind up at the end and are easy to trim off.

But the current situation, silently returning arrays in which the non-NaNs are unsorted, is really bad.

If your going to do isnan anyway, why not just raise an exception. An array with NaNs in it can't be sorted by any common sense definition of sorting. Any treatment of NaNs is going to be arbitrary, so we might as well make the user specify what they want. "In the face of ambiguity, refuse the temptation to guess" and all that.

My favorite solution would be to make sort respect the invalid mode of seterr/geterr. However at the moment it doesn't seem to (in beta4 at least) but neither does add or multiply so those probably need to be looked at again....

The geterr/seterr stuff changes how IEEE hardware flags are handled in ufuncs. Currently they are not even looked at elsewhere. Are you saying that add and multiply don't respect the invalid flag? If not, then this might be hardware related. Does the IEEE invalid hardware flag get raised on multiplication by nan or only on creation of nan? It looks like I jumped to conclusions here. I was expecting that with invalid set to 'raise' that an array that (+-*operations on nans would raise an exception. It appears that is incorrect -- only operations that create nans from non-nans will trigger this as you suspected. Similarly, I expected that over='raise' would cause inf+something to raise an exception. Again, not true; this raises an exception only when a new inf is created from non infs. At least on my box.

Interesting. And a little surprising.

An interesting tidbit (in an array) inf/inf will raise invalid, but nan/nan will not, it just returns nan. Fun, fun fun!

...
All the seterr/geterr stuff relies on the hardware flags. We don't do any other checking.

Yeah. And just by itself this isn't going to do anything for sort since comparing nans will not set any flags, it's just that the result will be problematic.If one wanted to use these flags for this one would have to use/abuse the result of geterr to trigger an isnan test at the beginning of sort and then warn, raise or call as appropriate.

It would probably also not be unreasonable to punt and document sort as failing in the presence of nans.

For floats we could use something like: lessthan(a,b) := a < b || (a == nan && b != nan) Which would put all the nans at one end and might not add too much overhead. Chuck

A. M. Archibald

7:41 p.m.

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

For floats we could use something like:

lessthan(a,b) := a < b || (a == nan && b != nan)

Which would put all the nans at one end and might not add too much overhead.

You could put an any(isnan()) out front and run this slower version only if there are any NaNs (also, you can't use == for NaNs, you have to use C isNaN). But I'm starting to see the wisdom in simply throwing an exception, since sorting is not well-defined with NaNs. A. M. Archibald

Charles R Harris

7:55 p.m.

On 9/19/06, A. M. Archibald <peridot.faceted@gmail.com> wrote:

...

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
For floats we could use something like:

lessthan(a,b) := a < b || (a == nan && b != nan)

Which would put all the nans at one end and might not add too much

overhead.

You could put an any(isnan()) out front and run this slower version only if there are any NaNs (also, you can't use == for NaNs, you have to use C isNaN). But I'm starting to see the wisdom in simply throwing an exception, since sorting is not well-defined with NaNs.

Looks like mergesort can be modified to sort around the NaNs without too much trouble if there is a good isnan function available: just cause the pointers to skip over them. I see that most of the isnan stuff seems to be in the ufunc source and isn't terribly simple. Could be broken out into a separate include, I suppose. I still wonder if it is worth the trouble. As to raising an exception, I seem to recall reading somewhere that exception code tends to be expensive, I haven't done any benchmarks myself. Chuck

Tim Hochberg

8:45 p.m.

Charles R Harris wrote:

...

On 9/19/06, *A. M. Archibald* <peridot.faceted@gmail.com <mailto:peridot.faceted@gmail.com>> wrote:

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote: > > > > For floats we could use something like: > > lessthan(a,b) := a < b || (a == nan && b != nan)

I believe this would have to be some sort of isnan macros since everything compares not equal to nan. I forget the correct spelling at the moment though,

...

> > Which would put all the nans at one end and might not add too much overhead.

You could put an any(isnan()) out front and run this slower version only if there are any NaNs (also, you can't use == for NaNs, you have to use C isNaN). But I'm starting to see the wisdom in simply throwing an exception, since sorting is not well-defined with NaNs.

Looks like mergesort can be modified to sort around the NaNs without too much trouble if there is a good isnan function available: just cause the pointers to skip over them. I see that most of the isnan stuff seems to be in the ufunc source and isn't terribly simple. Could be broken out into a separate include, I suppose.

I still wonder if it is worth the trouble. As to raising an exception, I seem to recall reading somewhere that exception code tends to be expensive, I haven't done any benchmarks myself.

I'm still somewhat mystified by the desire to move the nans to one end of the sorted object. I see two scenarios: 1. The user is not expecting to have nans in the data. In this case moving the nans to end is not helpful. The resulting array is still not sorted in the sense that a[i-1]<=a[i]<=a[i+1] does not hold and thus is likely to break code that relies on the array being sorted. The most prominent example of which is searchsorted. In this case you really want to raise an exception if possible since no good will come from letting the code continue to run. In this case the time in involved in throwing and catching an exception is irrelevant. 2. The user *is* expecting to have nans in the data. This is presumably the case that the sorting-nans-to-the-end idea is aimed at. So far at least the suggested use has been to sort and then strip the nans. I suggest that a better approach is to test for and strip the nans before the sort. For example: # a is an array that may have some nans # you can do this more pithily, but I'm aiming to minimize isnan calls # note that this *sometimes* makes a copy. nanmask = isnan(a) if sometrue(nanmask): a = a[not nanmask] a.sort() #..... I presume that isnan is somewhat more expensive than the basic '<' operator. In the proposed sort to end version we need N*log(N) isnan calls versus just N in the above case. The sort to end case probably won't look any cleaner than the above either since you still need to count the nans to determine how many to strip. Perhaps there's some use for the sort to end behaviour that I'm missing, but the raise an exception behaviour sure looks a lot more appealing to me. Here's a strawman proposal: Sort the array. Then examine numpy.geterr()['invalid']. If it is not 'ignore', then check examine sometrue(isnan(thearray)). If the latter is true then raise and error, issue a warning or call the error reporting functioni as appropriate. Note that we always sort the array to be consistent with the behaviour of the ufuncs that proceed even when they end up raising an exception. -tim

A. M. Archibald

9:09 p.m.

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

I'm still somewhat mystified by the desire to move the nans to one end of the sorted object. I see two scenarios:

It's mostly to have something to do with them other than throw an exception. Leaving them in place while the rest of the array is reshuffled requires a lot of work and isn't particularly better. I mostly presented it as an alternative to throwing an exception. Throwing a Python exception now seems like the most reasonable idea. A. M. Archibald

Charles R Harris

9:42 p.m.

On 9/19/06, A. M. Archibald <peridot.faceted@gmail.com> wrote:

...

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
I'm still somewhat mystified by the desire to move the nans to one end of the sorted object. I see two scenarios:

It's mostly to have something to do with them other than throw an exception. Leaving them in place while the rest of the array is reshuffled requires a lot of work and isn't particularly better. I mostly presented it as an alternative to throwing an exception.

Throwing a Python exception now seems like the most reasonable idea.

Well, mergesort can be adapted without a lot of work. Could be used to sort masked arrays too, not that I know why anyone would want that, but then again, I haven't used masked arrays. Agreed, throwing some sort of error still seems the simplest thing to do.

Rick White

20 Sep 20 Sep

7:09 a.m.

On Sep 19, 2006, at 9:45 PM, Tim Hochberg wrote:

...

Perhaps there's some use for the sort to end behaviour that I'm missing, but the raise an exception behaviour sure looks a lot more appealing to me.

FYI, in IDL the NaN values wind up at the end of the sorted array. That's true despite the fact that IDL does respect all the comparison properties of NaNs (i.e. Value>NaN, Value<NaN, and Value==NaN are all false for any value). So clearly the sort behavior was created deliberately. It's also the case that the median for arrays including NaN values is computed as the median of the defined values, ignoring the NaNs. My view is that if the user has NaN values in the array, sort should respect the float exception flags and should only raise an exception if that is what the user has requested.

...

Here's a strawman proposal:

Sort the array. Then examine numpy.geterr()['invalid']. If it is not 'ignore', then check examine sometrue(isnan(thearray)). If the latter is true then raise and error, issue a warning or call the error reporting functioni as appropriate. Note that we always sort the array to be consistent with the behaviour of the ufuncs that proceed even when they end up raising an exception.

Here's another proposal: Make a first pass through the array, replacing NaN values with Inf and counting the NaNs ("nancount"). Raise an exception at this point if NaNs are not supposed to be allowed. Otherwise sort the array, and then as the last step replace the trailing NaNcount values with NaN. It seems to me that this would give predictable results while respecting the exception flags at little extra cost. Rick

Charles R Harris

19 Sep 19 Sep

4:21 p.m.

On 9/19/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

Charles R Harris wrote:

...
On 9/19/06, *Tim Hochberg* <tim.hochberg@ieee.org <mailto:tim.hochberg@ieee.org>> wrote:

A. M. Archibald wrote: > On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org <mailto:tim.hochberg@ieee.org>> wrote: > >> Keith Goodman wrote: >> >>> In what order would you like argsort to sort the values -inf, nan, inf? >>> >>> >> Ideally, -inf should sort first, inf should sort last and nan should >> raise an exception if present. >> >> -tim >> > > Mmm. Somebody who's working with NaNs has more or less already decided > they don't want to be pestered with exceptions for invalid data. Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

> I'd > be happy if they wound up at either end, but I'm not sure it's

worth

...
> hacking up the sort algorithm when a simple isnan() can pull them out. > Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or

doing

...
one or the other depending on the error mode settings would be even better if it is practical.

> What's happening now is that NaN<a, NaN==a, and NaN>a are all false, > which rather confuses the sorting algorithm. But as long as it doesn't > actually *break* (that is, leave some of the non-NaNs incorrectly > sorted) I don't care. > Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to

be

...
no. Try running this a couple of times to see what I mean:

n = 10 for i in range(10): for kind in 'quicksort', 'heapsort', 'mergesort': a = rand(n) b = a.copy() a[n//2] = nan a.sort(kind=kind) b.sort(kind=kind) assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b)

The values don't correctly cross the inserted NaN and the sort is incorrect.

Except for heapsort those are doing insertion sorts because n is so small. Heapsort is trickier to understand because of the way the heap is formed and values pulled off.

I'm not sure where the breakpoint is, but I was seeing failures for all three sort types with N as high as 10000. I suspect that they're all broken in the presence of NaNs. I further suspect you'd need some punishingly slow n**2 algorithm to be robust in the presence of NaNs.

Quicksort and Mergesort are divide and conquer types. When the pieces get below a certain length they finish up with insertion sort as it is more efficient for small bits. In numpy the breakover points are 15 and 20 respectively. I believe microsoft did a project on this and ended up with a number somewhere around 7, but I didn't see that when doing the tuning for numarray way back when. Not that I did a *lot* of careful tuning work ;)

...

I'll look into that. I bet searchsorted gets messed up by nan's. Do

...
you think it worthwhile to worry about it?

No. But that's because it's my contention is that any sequence with NaNs in it is *not sorted* and thus isn't suitable input for searchsorted.

If this sort of thing can cause unexpected errors I wonder if it would be worth it to have a global debugging flag that essentially causes isnan to be called before any function applications. Chuck

A. M. Archibald

4:28 p.m.

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com> wrote:

...

If this sort of thing can cause unexpected errors I wonder if it would be worth it to have a global debugging flag that essentially causes isnan to be called before any function applications.

That sounds very like the IEEE floating-point flags, which would be extremely useful to have, and which are being wored on, IIRC. A. M. Archibald

Tim Hochberg

4:37 p.m.

A. M. Archibald wrote:

...

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
If this sort of thing can cause unexpected errors I wonder if it would be worth it to have a global debugging flag that essentially causes isnan to be called before any function applications.

That sounds very like the IEEE floating-point flags, which would be extremely useful to have, and which are being wored on, IIRC.

Are you referring to numpy.geterr / seterr? These exist, but they don't seem to be working right as of 1.0b4 (my installation is a little out of date). -tim

Charles R Harris

5 p.m.

On 9/19/06, A. M. Archibald <peridot.faceted@gmail.com> wrote:

...

On 19/09/06, Charles R Harris <charlesr.harris@gmail.com> wrote:

...
If this sort of thing can cause unexpected errors I wonder if it would be worth it to have a global debugging flag that essentially causes isnan to be called before any function applications.

That sounds very like the IEEE floating-point flags, which would be extremely useful to have, and which are being wored on, IIRC.

Thinking a bit, keeping the values in place isn't easy. Mergesort isn't fixable because values can be moved in front of the nan before it is ever looked at. Nor can it be easily set up to leave all the nans at one end because both a < nan and nan < a return false. Quicksort might be doable with some checks. I mean, what if the selected pivot is a nan? The median of three version used also needs thinking about. Hmm. But I think it is the insertion sort that is messing up the order in mergesort as now nothing will move past the nan even if it has to. That could be fixed, but the nan's would still move around. I think the best thing to do is punt unless the hardware can be set to do something. Chuck

Christopher Barker

20 Sep 20 Sep

5:18 p.m.

Charles R Harris wrote:

...

Thinking a bit, keeping the values in place isn't easy.

Why the heck would "in place" be desirable for sorted data anyway? I understand that it means that if there is a NaN in the nth position before sorting, there will be one in the nth position after sorting. However, I see absolutely no reason at all why that would be useful (or any more useful than putting them anywhere else) A couple years ago, there was a long debate on this list about whether numpy should pass -inf, NaN, and +inf through all the ufuncs without error. there were two schools of thought: 1) They indicate a problem, the programmer should know about hat problem as soon as it occurs, not at the end of the computation, many steps later, when they might get presented with nothing but NaNs. 2) The whole point of "vector" computation is that you can act on a whole bunch of numbers at once. If only subset of those numbers are invalid, why stop the process. Raising an error when a single number has a problem defeats the purpose of vector operations. It seems that numpy has settled on school of thought (2), at least by default. That being the case, it should apply to sorting also. If it does, then that means no exception will be raised, but it makes no difference where the heck the NaNs end up in the sorted array, as long as everything else is in order. NaN means exactly what it's called: it's not a number, so it doesn't matter what you do with them, as long as they are preserved and don't mess up other things. Let the coder decide what they want to so with them, and when they want to do it. Personally, I'd prefer that they all ended up at the beginning or end after sorting, but it really doesn't much matter. That being said, if it's impossible to do a efficient sort with NaNs mixed in, then we'll just have to live with it. It really would be best if an exception was raised if the non-NaN values are not going to be sorted correctly -- that really would surprise people!

...

It would probably also not be unreasonable to punt and document sort as failing in the presence of nans.

That would be one of the worst options, but may be the only one available. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

A. M. Archibald

19 Sep 19 Sep

4:09 p.m.

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...

A. M. Archibald wrote:

...
Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data. Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

Well, I said that because for an image porcessing project I was doing, the easiest thing to do with certain troublesome pixels was to fill in NaNs, and then at the end replace the NaNs with sensible values. It seems as if the point of NaNs is to allow you to keep working with those numbers that make sense while ingoring those that don't. If you wanted exceptions, why not get them as soon as the first NaN would have been generated?

...

...
I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out.

Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or doing one or the other depending on the error mode settings would be even better if it is practical.

I was just thinking in terms of easy removal.

...

Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to be no. Try running this a couple of times to see what I mean:

...

The values don't correctly cross the inserted NaN and the sort is incorrect.

You're quite right: when NaNs are present in the array, sorting and then removing them does not yield a sorted array. For example, mergesort just output [ 2. 4. 6. 9. nan 0. 1. 3. 5. 7. 8. ] The other two are no better (and arguably worse). Python's built-in sort() for lists has the same problem. This is definitely a bug, and the best way to fix it is not clear to me - perhaps sort() needs to always do any(isnan(A)) before starting to sort. I don't really like raising an exception, but sort() isn't really very meaningful with NaNs in the array. The only other option I can think of is to somehow remove them, sort without, and reintroduce them at the end, which is going to be a nightmare when sorting a single axis of a large array. Or, I suppose, sort() could simply fill the array with NaNs; I'm sure users will love that. A. M. Archibald

Xavier Gnata

20 Sep 20 Sep

8:15 a.m.

IMHO, the only correct way to handle this case is to raise an exception. It does not make sense to compare NaN and "real" numbers. It could be very confusing not to raise an exception. Xavier.

...

On 19/09/06, Tim Hochberg <tim.hochberg@ieee.org> wrote:

...
A. M. Archibald wrote:

...
Mmm. Somebody who's working with NaNs has more or less already decided they don't want to be pestered with exceptions for invalid data.

Do you really think so? In my experience NaNs are nearly always just an indication of a mistake somewhere that didn't get trapped for one reason or another.

Well, I said that because for an image porcessing project I was doing, the easiest thing to do with certain troublesome pixels was to fill in NaNs, and then at the end replace the NaNs with sensible values. It seems as if the point of NaNs is to allow you to keep working with those numbers that make sense while ingoring those that don't. If you wanted exceptions, why not get them as soon as the first NaN would have been generated?

...
...
I'd be happy if they wound up at either end, but I'm not sure it's worth hacking up the sort algorithm when a simple isnan() can pull them out.

Moving them to the end seems to be the worst choice to me. Leaving them alone is fine with me. Or raising an exception would be fine. Or doing one or the other depending on the error mode settings would be even better if it is practical.

I was just thinking in terms of easy removal.

...
Is that true? Are all of numpy's sorting algorithms robust against nontransitive objects laying around? The answer to that appears to be no. Try running this a couple of times to see what I mean:

...
The values don't correctly cross the inserted NaN and the sort is incorrect.

You're quite right: when NaNs are present in the array, sorting and then removing them does not yield a sorted array. For example, mergesort just output [ 2. 4. 6. 9. nan 0. 1. 3. 5. 7. 8. ]

The other two are no better (and arguably worse). Python's built-in sort() for lists has the same problem.

This is definitely a bug, and the best way to fix it is not clear to me - perhaps sort() needs to always do any(isnan(A)) before starting to sort. I don't really like raising an exception, but sort() isn't really very meaningful with NaNs in the array. The only other option I can think of is to somehow remove them, sort without, and reintroduce them at the end, which is going to be a nightmare when sorting a single axis of a large array. Or, I suppose, sort() could simply fill the array with NaNs; I'm sure users will love that.

A. M. Archibald

------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

-- ############################################ Xavier Gnata CRAL - Observatoire de Lyon 9, avenue Charles André 69561 Saint Genis Laval cedex Phone: +33 4 78 86 85 28 Fax: +33 4 78 86 83 86 E-mail: gnata@obs.univ-lyon1.fr ############################################

6624

Age (days ago)

6625

Last active (days ago)

List overview

Download

27 comments

9 participants

participants (9)

A. M. Archibald
Charles R Harris
Christopher Barker
Keith Goodman
Rick White
Sasha
Tim Hochberg
Travis Oliphant
Xavier Gnata

sorting -inf, nan, inf

Keith Goodman

Keith Goodman

Xavier Gnata

tags

participants (9)