On Mon, Jan 7, 2019, 11:38 AM Steven D'Aprano <steve@pearwood.info wrote:

Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle?

I can see reason that median per-se requires a total order. Yes, the implementation chosen (and many reasonable and obvious implementations) make that assumption. But here is a perfectly reasonable definition of median:

* A median is an element of a collection such that 1/2 of all elements of the collection are less than it.

Depending on how you interpret median, this element might also not be in the original collection, but be some newly generated value that has that property. E.g. statistics.median([1,2,3,4]) == 2.5.

Under a partial ordering, a median may not be unique. Even under a total ordering this is true if some subset of elements form an equivalence class. But under partial ordering, the non-uniqueness can get much weirder.

What is explicitly documented is that median requires numeric data, and

NANs aren't numbers. So the only bug here is the caller's failure to

filter out NANs. If you pass it garbage data, you get garbage results.

OK, then we should either raise an exception or propagate the NaN if that is the intended meaning of the function. And obviously document that such is the assumption. NaN's *are* explicitly in the floating-point domain, so it's fuzzy whether they are numeric or not, notwithstanding the name.

I'm very happy to push NaN-filtering to users (as NumPy does, although it provides alternate functions for many reductions that incorporate this... the basic ones always propagate NaNs though).

Nevertheless, it is a perfectly reasonable thing to want to use data

which may or may not contain NANs, and I want to enhance the statistics

module to make it easier for the caller to handle NANs in whichever way

they see fit. This is a new feature, not a bug fix.

I disagree about bug vs. feature. The old behavior is simply and unambiguously wrong, but was not previously noticed. Obviously, the bug does not affect most uses, which is why it was not noticed.

If you truly believe that, then you should also believe that both

list.sort() and the bisect module are buggy, for precisely the same

reason.

I cannot perceive any close connection between the correct behavior of statistics.mean() and that of list.sort() or bisect. I know the concrete implementation of the former uses the latter, but the answers for what is RIGHT feel completely independent to me.

I doubt Quickselect will be immune to the problem of NANs. It too relies

on comparisons, and while I don't know for sure that it requires a total

order, I'd be surprised if it doesn't. Quickselect is basically a

variant of Quicksort that only partially sorts the data.

Yes, I was thinking of trying to tweak Quickselect to handle NaNs during the process. I.e. probably terminate and propagate the NaN early, as soon as one is encountered. That might save much of the work if a NaN is encountered early and most comparisons and moves can be avoided. Of course, I'm sure there is a worst case where almost all the work is done before a NaN check is performed in some constructed example.