[New-bugs-announce] [issue21592] Make statistics.median run in linear time

Thomas Dybdahl Ahle report at bugs.python.org
Wed May 28 14:11:38 CEST 2014

New submission from Thomas Dybdahl Ahle:

The statistics module currently contains the following comment:

"FIXME: investigate ways to calculate medians without sorting? Quickselect?"

This is important, because users expect standard library functions to use state of the art implementations, and median by sorting has never been that.

There are many good linear time alternatives, the classical median-of-k algorithm was posted to the mailing list in a nice version by David Eppstein in 2002 [1]. The fastest method in practice is probably Quick Select in the Floyd-Rivest version [2] which is similar to quick sort.

These algorithms also have the feature of letting us select any k-th order statistic, not just the median. This seems conceptually a lot simpler than the current median/median_low/median_high split.

However, sticking with the current api, a new implementation would have to support calculating a median as the average of n//2 and (n+1)//2'th order statistics. This could be implemented either by calling the select function twice, or by implementing a multi-select algorithm, which is also a well studied problem [3].

I'll be happy to contribute code, or help out in any other way.

[1]: https://mail.python.org/pipermail/python-list/2002-May/132170.html
[2]: https://en.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm
[3]: https://domino.mpi-inf.mpg.de/intranet/ag1/ag1publ.nsf/0/59C74289A2291143C12571C30017DEA8/$file/Mehlhorn_a_2005_o.pdf

components: Library (Lib)
messages: 219265
nosy: Thomas.Dybdahl.Ahle
priority: normal
severity: normal
status: open
title: Make statistics.median run in linear time
type: performance
versions: Python 3.4, Python 3.5

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list