
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
I ask for a documentation patch and you start talking about a whole new implementation. Huh. A new implementation with precisely the same behaviour is a waste of time, so I presume you're planning to change the behaviour. How about if you start off by explaining what the new semantics are?
I think it would be counter-productive to document the bug (as something other than a bug).
Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle? What is explicitly documented is that median requires numeric data, and NANs aren't numbers. So the only bug here is the caller's failure to filter out NANs. If you pass it garbage data, you get garbage results. Nevertheless, it is a perfectly reasonable thing to want to use data which may or may not contain NANs, and I want to enhance the statistics module to make it easier for the caller to handle NANs in whichever way they see fit. This is a new feature, not a bug fix.
Picking what is a completely arbitrary element in face of a non-total order can never be "correct" behavior, and is never worth preserving for compatibility.
If you truly believe that, then you should also believe that both list.sort() and the bisect module are buggy, for precisely the same reason. Perhaps you ought to raise a couple of bug reports, and see if you can get Tim and Raymond to agree that sorting and bisect should do something other than what they already do in the face of data that doesn't define a total order.
I think the use of statistics.median against partially ordered elements is simply rare enough that no one tripped against it, or at least no one reported it before.
I'm sure it is rare. Nevertheless, I still want to make it easier for people to deal with this case.
Notice that the code itself pretty much recognizes the bug in this comment:
# FIXME: investigate ways to calculate medians without sorting? Quickselect?
I doubt Quickselect will be immune to the problem of NANs. It too relies on comparisons, and while I don't know for sure that it requires a total order, I'd be surprised if it doesn't. Quickselect is basically a variant of Quicksort that only partially sorts the data.
So it seems like the original author knew the implementation was wrong.
That's not why I put that comment in. Sorting is O(N log N) on average, and Quickselect can be O(N) on average. In principle, Quickselect or a similar selection algorithm could be faster than sorting. [...]
It's not hard to manually check for NaNs and generate those in your own code.
That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary. -- Steve