Adding quantile to the statistics module

Hello everyone, Sorry if this subject has already been covered in the mailing list but I could not find it. My question is very simple: should the `quantile` function be added to python `statistics` module. I was very happy to learn the existence of this module in python3 only to later be forced to install numpy to compute a .7 quantile which is even more frustrating since the module actually provides an implementation of the median. I would therefore be willing to submit a PR to add the `quantile` function to the module. The function would have the following signature: ``` # data -> your sequence # p -> the quantile, between 0 & 1 def quantile(data, p) # example quantile([1, 2, 3, 4, 5], 0.5)
3
This would also mean implementing very simply the `quantile_low` &
`quantile_high` function as their median counterparts in the module.
I am unclear, however, how to implement a hypothetic `quantile_grouped`.
What do you think?

On Thu, Mar 15, 2018 at 12:39 PM, PLIQUE Guillaume <guillaumeplique@gmail.com> wrote:
This seems like a reasonable idea to me -- but be warned that there are actually quite a few slightly-different definitions of "quantile" in use. R supports 9 different methods of calculating quantiles (exposed via an interesting API: their quantile function takes a type= argument, which is an integer between 1 and 9; the default is 7). And there's currently an open issue at numpy discussing whether numpy implements the right approaches: https://github.com/numpy/numpy/issues/10736 So this would require some research to decide on which definition(s) you wanted to support. -n -- Nathaniel J. Smith -- https://vorpus.org

Steven D'Aprano writes:
My take is "don't let the perfect be the enemy of the good." Quantiles are used a lot for "government work". Pick a definition that's good enough for that. Pick one that is commonly used and has nice invariance properties if any are applicable (e.g., quartile1(list) == quartile3(reversed(list), although I'm not even sure that is appropriate!), document it carefully, and give (some) examples of the edge cases that affect comparability of the statistics module's computation to alternative formulae. I'd like to see your list written up. To my mind, it would be of enough general interest that you could probably publish the results of your research in a political science or maybe psychology journal, or a statistics journal oriented to practitioners and/or educators[1]. I'm not suggesting you should do the work involved in actually submitting unless there's benefit to you in it, I'm just saying I think it's that interesting. Regards, Steve Footnotes: [1] I'd find it useful in talking to my students about trusting computers, for one thing.

That's really interesting. I did not know there were so many way to consider quantiles. Maybe we should indeed wait for numpy to take a decision on the matter and go with their default choice so we remain consistent with the ecosystem? 2018-03-16 5:36 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:

PLIQUE Guillaume writes:
The example of R with 9 variants baked into one function suggests that numpy is unlikely to come up with a single "good" choice. If R's default is to Steven's taste, I would say go with that for cross- language consistency, and hope that numpy makes the same decision. In fact, I would argue that numpy might very well make a decision for a default that has nice mathematical properties, while the stdlib module might very well prefer consistency with R's default since defaults will be used in the same kind of "good enough for government work" contexts in both languages. The main thing is that they're all going to give similar results and in most applications the data will be fuzzy (eg, a sample or subjective), so as long as the same version is accurately documented and used consistently across analyses that should be comparable, results will be sufficiently accurate, perfectly reproducible, and comparable. For my purposes, there's no reason to wait. It's up to Steven, and I trust his taste. Steve

On Fri, Mar 16, 2018 at 11:19 PM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
NumPy already has a default and supports a number of variants. I'd have to go digging to figure out which languages/tools use which methods and how those match to theoretical properties, but IIRC numpy, R, and matlab all have different defaults. The 9 types that R supports come from a well-known review article (Hyndman & Fan, 1996). Their docs note that Hyndman & Fan's recommendation is different from the default, because the default was chosen to match a previous package (S) before they read Hyndman & Fan. It's all a bit messy. None of this is to say that Python shouldn't have some way to compute quantiles, but unfortunately you're not going to find TOOWTDI. -n -- Nathaniel J. Smith -- https://vorpus.org

Since Python is not held to backwards compatibility with S, and for most datasets (and users) it doesn't matter much, why not ho with the default recommended by Hyndman & Fan? On Fri, Mar 16, 2018 at 11:48 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

[Guido]
Here's Hyndman in 2016[1]: """ The main point of our paper was that statistical software should standardize the definition of a sample quantile for consistency. We listed 9 different methods that we found in various software packages, and argued for one of them (type 8). In that sense, the paper was a complete failure. No major software uses type 8 by default, and the diversity of definitions continues 20 years later. In fact, the paper may have had the opposite effect to what was intended. We drew attention to the many approaches to computing sample quantiles and several software products added them all as options. Our own quantile function for R allows all 9 to be computed, and has type 7 as default (for backwards consistency – the price we had to pay to get R core to agree to include our function). """ Familiar & hilarious ;-) [1] https://robjhyndman.com/hyndsight/sample-quantiles-20-years-later/

Hahaha, that Hyndman story will never get old. FWIW, based on much informal polling, the most common intuition on the topic stems from elementary education: a median of an even-numbered set is the mean of the two central values. So, linear-weighted average on discontinuities seems to be least surprising. Whichever type is chosen, quantiles are often computed in sets. For instance, min/max/median, quartiles (+ interquartile range), and percentiles. Quantiles was one of the main reasons statsutils uses a class[1] to wrap datasets. Otherwise, there's a lot of work in resorting. All the galloping in the world isn't going to beat sorting once. :) Other calculations benefit from this cached approach, too. Variance is faster to calculate after calculating stddev, for instance, but if memory serves, quantiles are the most expensive for mid-sized datasets that don't call for pandas/numpy. [1]: http://boltons.readthedocs.io/en/latest/statsutils.html#boltons.statsutils.S... On Sat, Mar 17, 2018 at 9:28 AM, Tim Peters <tim.peters@gmail.com> wrote:

[Guido]
BTW, I should clarify that I agree! H&F didn't invent "method 8", or any of the other methods their paper named, they just evaluated 9 methods with a keen eye. Their case for method 8 being "the best" seems pretty clear: it satisfies as least as many desirable formal properties as the other serious candidates, and is apparently optimal in some technical senses (related to avoiding bias) among methods that can't assume anything about the underlying distribution. For the median, looks like method 8 reduces to the usual "return the mean of the two middle values" for an even number of data points, which is the only "intuition" anyone brings to this ;-) So I'd make it the default, and add others later as options if there's enough screaming. It's the Right(est) Thing to Do.

On Fri, Mar 16, 2018 at 01:36:31PM +0900, Stephen J. Turnbull wrote:
I'd like to see your list written up.
On checking my notes, there is considerable overlap in the numbers above (some calculatation methods are equivalent to others) but overall I find a total of 16 distinct methods in use. Some of those are only suitable for generating quartiles. This should not be considered an exhaustive list, I may have missed some. Additions and corrections will be welcomed :-) My major sources are Hyndman & Fan: https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf and Langford: https://ww2.amstat.org/publications/jse/v14n3/langford.html Langford concentrates on methods of calculating quartiles, while Hyndman & Fan consider more general quantile methods. Obviously if you have a general quantile method, you can use it to calculate quartiles. I have compiled a summary in the following table. Reading across the row are the (usually numeric) label or parameter used to specify a calculation method. Entries in the same column are the same calculation method regardless of the label. For example, what Hyndman & Fan call method 1, Langford calls method 15, and the SAS software uses a parameter of 3. The Excel QUARTILE function is equivalent to what H&F call method 7 and what Langford calls 12. You will need to use a monospaced font for the columns to line up. H&F 1 2 3 4 5 6 7 8 9 Langford 15 4 14 13 10 11 12 1 2 5 6 9 Excel Q Excel 2010+ QE QI JMP X Maple 1 2 3 4 5 6 7 8 Mathematica AQ MQ Minitab X R 1 2 3 4 5 6 7 8 9 S X SAS 3 5 2 1 4 SPSS X TI calc X Notes: X Only calculation method used by the software. Q Excel QUARTILE function (pre 2010) QE Excel QUARTILE.EXC function QI Excel QUARTILE and QUARTILE.INC functions AQ Mathematica AsymmetricQuartiles function MQ Mathematica Quartiles function Langford's 3 and 7 (not shown) is the same as his 1; his 8 (not shown) is the same as his 2. Hyndman & Fan recommend method 8 as the best method for general quantiles. Langford (who has certainly read H&F) recommends his method 4, which is H&F's method 2, as the standard quartile. That is the same as the default used by SAS. For what it's worth, the method taught in Australian high schools for calculating quartiles and interquartile range is Langford's method 2. That's the method that Texas Instruments calculators use. I haven't personally confirmed all of the software equivalences, in particular I'm a bit dubious about the Maple methods. If anyone has access to Maple and doesn't mind running a few sample calculations for me, please contact me off-list. -- Steve

On Thu, Mar 15, 2018 at 12:39 PM, PLIQUE Guillaume <guillaumeplique@gmail.com> wrote:
This seems like a reasonable idea to me -- but be warned that there are actually quite a few slightly-different definitions of "quantile" in use. R supports 9 different methods of calculating quantiles (exposed via an interesting API: their quantile function takes a type= argument, which is an integer between 1 and 9; the default is 7). And there's currently an open issue at numpy discussing whether numpy implements the right approaches: https://github.com/numpy/numpy/issues/10736 So this would require some research to decide on which definition(s) you wanted to support. -n -- Nathaniel J. Smith -- https://vorpus.org

Steven D'Aprano writes:
My take is "don't let the perfect be the enemy of the good." Quantiles are used a lot for "government work". Pick a definition that's good enough for that. Pick one that is commonly used and has nice invariance properties if any are applicable (e.g., quartile1(list) == quartile3(reversed(list), although I'm not even sure that is appropriate!), document it carefully, and give (some) examples of the edge cases that affect comparability of the statistics module's computation to alternative formulae. I'd like to see your list written up. To my mind, it would be of enough general interest that you could probably publish the results of your research in a political science or maybe psychology journal, or a statistics journal oriented to practitioners and/or educators[1]. I'm not suggesting you should do the work involved in actually submitting unless there's benefit to you in it, I'm just saying I think it's that interesting. Regards, Steve Footnotes: [1] I'd find it useful in talking to my students about trusting computers, for one thing.

That's really interesting. I did not know there were so many way to consider quantiles. Maybe we should indeed wait for numpy to take a decision on the matter and go with their default choice so we remain consistent with the ecosystem? 2018-03-16 5:36 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:

PLIQUE Guillaume writes:
The example of R with 9 variants baked into one function suggests that numpy is unlikely to come up with a single "good" choice. If R's default is to Steven's taste, I would say go with that for cross- language consistency, and hope that numpy makes the same decision. In fact, I would argue that numpy might very well make a decision for a default that has nice mathematical properties, while the stdlib module might very well prefer consistency with R's default since defaults will be used in the same kind of "good enough for government work" contexts in both languages. The main thing is that they're all going to give similar results and in most applications the data will be fuzzy (eg, a sample or subjective), so as long as the same version is accurately documented and used consistently across analyses that should be comparable, results will be sufficiently accurate, perfectly reproducible, and comparable. For my purposes, there's no reason to wait. It's up to Steven, and I trust his taste. Steve

On Fri, Mar 16, 2018 at 11:19 PM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
NumPy already has a default and supports a number of variants. I'd have to go digging to figure out which languages/tools use which methods and how those match to theoretical properties, but IIRC numpy, R, and matlab all have different defaults. The 9 types that R supports come from a well-known review article (Hyndman & Fan, 1996). Their docs note that Hyndman & Fan's recommendation is different from the default, because the default was chosen to match a previous package (S) before they read Hyndman & Fan. It's all a bit messy. None of this is to say that Python shouldn't have some way to compute quantiles, but unfortunately you're not going to find TOOWTDI. -n -- Nathaniel J. Smith -- https://vorpus.org

Since Python is not held to backwards compatibility with S, and for most datasets (and users) it doesn't matter much, why not ho with the default recommended by Hyndman & Fan? On Fri, Mar 16, 2018 at 11:48 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

[Guido]
Here's Hyndman in 2016[1]: """ The main point of our paper was that statistical software should standardize the definition of a sample quantile for consistency. We listed 9 different methods that we found in various software packages, and argued for one of them (type 8). In that sense, the paper was a complete failure. No major software uses type 8 by default, and the diversity of definitions continues 20 years later. In fact, the paper may have had the opposite effect to what was intended. We drew attention to the many approaches to computing sample quantiles and several software products added them all as options. Our own quantile function for R allows all 9 to be computed, and has type 7 as default (for backwards consistency – the price we had to pay to get R core to agree to include our function). """ Familiar & hilarious ;-) [1] https://robjhyndman.com/hyndsight/sample-quantiles-20-years-later/

Hahaha, that Hyndman story will never get old. FWIW, based on much informal polling, the most common intuition on the topic stems from elementary education: a median of an even-numbered set is the mean of the two central values. So, linear-weighted average on discontinuities seems to be least surprising. Whichever type is chosen, quantiles are often computed in sets. For instance, min/max/median, quartiles (+ interquartile range), and percentiles. Quantiles was one of the main reasons statsutils uses a class[1] to wrap datasets. Otherwise, there's a lot of work in resorting. All the galloping in the world isn't going to beat sorting once. :) Other calculations benefit from this cached approach, too. Variance is faster to calculate after calculating stddev, for instance, but if memory serves, quantiles are the most expensive for mid-sized datasets that don't call for pandas/numpy. [1]: http://boltons.readthedocs.io/en/latest/statsutils.html#boltons.statsutils.S... On Sat, Mar 17, 2018 at 9:28 AM, Tim Peters <tim.peters@gmail.com> wrote:

[Guido]
BTW, I should clarify that I agree! H&F didn't invent "method 8", or any of the other methods their paper named, they just evaluated 9 methods with a keen eye. Their case for method 8 being "the best" seems pretty clear: it satisfies as least as many desirable formal properties as the other serious candidates, and is apparently optimal in some technical senses (related to avoiding bias) among methods that can't assume anything about the underlying distribution. For the median, looks like method 8 reduces to the usual "return the mean of the two middle values" for an even number of data points, which is the only "intuition" anyone brings to this ;-) So I'd make it the default, and add others later as options if there's enough screaming. It's the Right(est) Thing to Do.

On Fri, Mar 16, 2018 at 01:36:31PM +0900, Stephen J. Turnbull wrote:
I'd like to see your list written up.
On checking my notes, there is considerable overlap in the numbers above (some calculatation methods are equivalent to others) but overall I find a total of 16 distinct methods in use. Some of those are only suitable for generating quartiles. This should not be considered an exhaustive list, I may have missed some. Additions and corrections will be welcomed :-) My major sources are Hyndman & Fan: https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf and Langford: https://ww2.amstat.org/publications/jse/v14n3/langford.html Langford concentrates on methods of calculating quartiles, while Hyndman & Fan consider more general quantile methods. Obviously if you have a general quantile method, you can use it to calculate quartiles. I have compiled a summary in the following table. Reading across the row are the (usually numeric) label or parameter used to specify a calculation method. Entries in the same column are the same calculation method regardless of the label. For example, what Hyndman & Fan call method 1, Langford calls method 15, and the SAS software uses a parameter of 3. The Excel QUARTILE function is equivalent to what H&F call method 7 and what Langford calls 12. You will need to use a monospaced font for the columns to line up. H&F 1 2 3 4 5 6 7 8 9 Langford 15 4 14 13 10 11 12 1 2 5 6 9 Excel Q Excel 2010+ QE QI JMP X Maple 1 2 3 4 5 6 7 8 Mathematica AQ MQ Minitab X R 1 2 3 4 5 6 7 8 9 S X SAS 3 5 2 1 4 SPSS X TI calc X Notes: X Only calculation method used by the software. Q Excel QUARTILE function (pre 2010) QE Excel QUARTILE.EXC function QI Excel QUARTILE and QUARTILE.INC functions AQ Mathematica AsymmetricQuartiles function MQ Mathematica Quartiles function Langford's 3 and 7 (not shown) is the same as his 1; his 8 (not shown) is the same as his 2. Hyndman & Fan recommend method 8 as the best method for general quantiles. Langford (who has certainly read H&F) recommends his method 4, which is H&F's method 2, as the standard quartile. That is the same as the default used by SAS. For what it's worth, the method taught in Australian high schools for calculating quartiles and interquartile range is Langford's method 2. That's the method that Texas Instruments calculators use. I haven't personally confirmed all of the software equivalences, in particular I'm a bit dubious about the Maple methods. If anyone has access to Maple and doesn't mind running a few sample calculations for me, please contact me off-list. -- Steve
participants (7)
-
Guido van Rossum
-
Mahmoud Hashemi
-
Nathaniel Smith
-
PLIQUE Guillaume
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tim Peters