On 3/5/20 9:10 AM, Steven D'Aprano wrote:
On Thu, Mar 05, 2020 at 08:23:22AM -0500, Richard Damon wrote:
Yes, that is the idea of AlmostTotalOrder, to have algorithms that really require a total order (like sorting) Sorting doesn't require a total order. Sorting only requires a weak order where the only operator required is the "comes before" operator, or less than. That's precisely how sorting in Python is implemented.
Here is an interesting discussion of a practical use-case of sorting data with a partial order:
https://blog.thecybershadow.net/2018/11/18/d-compilation-is-too-slow-and-i-a... Reading that, yes, there are applications of sorting that don't need total order, but as the article points out, many of the general purpose sorting algorithms do (like the one that Python uses in sort)
but we really need to use a type that has these exceptional values. Imagine that sort/median was defined to type check its parameter, No need to imagine it, sort already type-checks its arguments:
py> sorted([1, 3, 5, "Hello", 2]) TypeError: '<' not supported between instances of 'str' and 'int'
If you consider that proper type checking, then you must consider that the proper answer for the median of a list of numbers that contain a NaN is any of the numbers in the list. If Sort had an easy/cheap way to confirm that values passed to it met its assumptions, then it could make are reasonable response.
and that meant that you couldn't take the median of a list of floats (because float has the NaN value that breaks TotalOrder). Dealing with NANs depends on what you want to do with the data. If you are sorting for presentation purposes, what you probably want is to sort with a custom key that pushes all the NANs to the front (or rear) of the list. If you are sorting for the purposes of calculating the median, it depends. There are at least three reasonable strategies for median:
- ignore the NANs; - return a NAN; - raise an exception.
Personally, I think that the first is by far the most practical: if you have NANs in your statistical data, that's probably because they've come from some other library or application that is using them to represent missing values, and if that's the case, the right thing to do is to ignore them.
There was not that long ago about that very topic. All those options can be reasonable, but ignoring seems to me to be one of the worse options for a simple package (but reasonable for one where the whole package uses that convention). The danger of it is that if you get a NaN as a result of a computation generating your data, that error gets hidden by having the data just be ignored. I would say that in Python, it would make a lot more sense to use None as the missing data code, and leave NaN for invalid data/computations. That way you keep things explicit. The use of NaN here goes back to the use of strictly static typed languages for doing this, where NaN was a convenient special value to mark it. (prior to the invention of NaN you just used an impossible value for these). -- Richard Damon