thoughts on the new 3.4 statistics module

First of all: thank you, Steven and everyone else involved, for taking on the task of starting to implement this long-missed (at least by me) feature ! I really hope the module will be a success and grow over time. I have two thoughts at the moment about the implementation that I think may be worth discussing, if it hasn't happened yet (I have to admit I did not go through all previous posts on this topic, only read the PEP): First: I am not entirely convinced by when the module raises Errors. In some places its undoubtedly justified to raise StatisticsError (like when empty sequences are passed to mean()). On the other hand, should there really be an error, when for example no unique value for the mode can be found? Effectively, that would force users to guard every (!) call to the function with try/except. In my opinion, a better choice would be to return float('nan') or even better a module-specific object (call it Undefined or something) that one can check for. This behavior could, in general, be implemented for cases, where input can actually be handled and a result be calculated (like a list of values in the mode example), but this result is considered "undefined" by the algorithm. Second: I am not entirely happy with the three different flavors of the median function. I *do* know that this has been discussed before, but I'm not sure whether *all* alternatives have been considered (the PEP only talks about the median.low, median.high syntax, which, in fact, I wouldn't like that much either. My suggestion would be to have a resolve parameter, by which the behavior of a single median function can be modified. My main argument here is that as the module will grow in the future there will be many more such situations, in which different ways of calculating statistics are all totally acceptable and you would want to leave the choice to the user (the mode function can already be considered as an example: maybe the user would want to have the list of "modes" returned in case that no unambiguous value can be calculated; actually the current code seems to be prepared for later implementation of this feature because it does generate the list, just is not returning it). Now if, in all such situations, the solution is to have extra functions the module will soon end up completely cluttered with them. If, on the other hand, every function that will foreseeably have to handle ambiguous situations had a resolve parameter the module structure would be much clearer. In the median example you would then call median(data) for the default behavior, arguably the interpolation, but median(data, resolve='low') or median(data, resolve='high') for the alternative calculations. Statistically educated users could then guess, relatively easily, which functions have the resolve parameter and a quick look at the function's help could tell them, which arguments are accepted. Finally, let me just point out that these are really just first thoughts and I do understand that these are design decisions about which different people will have different opinions, but I think now is still a good time to discuss them, while with an established and (hopefully :) ) much larger module you will not be able to change things that easily anymore. Hoping for a lively discussion, Wolfgang

Hi! On Sat, Dec 21, 2013 at 02:29:14PM -0800, Wolfgang <wolfgang.maier@biologie.uni-freiburg.de> wrote:
Not necessary. The user of the library can combine a few calls in a function/method and catch one exception for the entire calculation. Or catch it even higher up the stack.
With such special values the user must check every return value. What is the advantage over catching exceptions? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Hi Wolfgang, and thanks for the feedback! My responses below. On Sat, Dec 21, 2013 at 02:29:14PM -0800, Wolfgang wrote:
There was no agreement on the best way to handle data with multiple modes, so we went with the simplest version that could work. It's easier to add functionality to the standard library than to take it away: better to delay putting something in for a release or two, than to put it in and then be stuck with the consequences of a poor decision for years. An earlier version of statistics.py included a mode function that let you specify the maximum number of modes. That function may eventually be added to the module, or made available on PyPI. The version included in the standard library implements the basic, school-book version of mode: it returns the one unique mode, as calculated by counting distinct values, or it fails, and the most Pythonic way to implement failure is with an exception.
Effectively, that would force users to guard every (!) call to the function with try/except.
No different from any other function. If you think a function might fail, then you guard it with try...except.
You can easily get that behaviour with a simple wrapper function: def my_mode(values): try: return mode(values) except StatisticsError: return float('nan') But I'm not convinced that this is appropriate for nominal data. Would you expect that the mode of ['red', 'blue', 'green'] should be a floating point NAN? I know I wouldn't.
For median, I don't believe this is appropriate. As a general rule, if a function has a parameter which is usually called with a constant known when you write the source code: median(data, resolve='middle') # resolve is known at edit-time especially if that parameter takes only two or three values, then the function probably should be split into two or three separate functions. I don't think that there are any common use-cases for selecting the type of median at *runtime*: kind = get_median_kind() median(data, resolve=kind) but if you can think of any, I'd like to hear them. However, your general suggestion isn't entirely inappropriate. In my research, I learned that there are at least fifteen different definitions of quartiles in common use, although some are mathematically equivalent. See here: http://www.amstat.org/publications/jse/v14n3/langford.html I find six distinct definitions for quartiles, and ten for quantiles/ fractiles. R supports nine different quantile versions, Haskell six, and SAS also supports multiple versions. (I don't remember how many.) Mathematica provides a four-argument parameterized version of Quantile. With six distinct versions of quartile, and ten of quantile, it's too many to provide separate functions for each: too much duplication, too much clutter. Most people won't care which quantile they get, so there ought to be a sensible default. For those who care about matching some particular version (say, that used by Excel, or that used by Texas Instruments calculators), there ought to be a parameter that allows you to select which version is used. R calls this parameter "type". I don't remember what SAS and Haskell call it, but the term I prefer is "scheme". I don't know if statistics.py will ever gain a function for calculating quantiles other than the median. I will probably put quantiles and quartiles on PyPI first, and if I do, I will follow your suggestion to provide a parameter to select the version used (although I'll probably call it "scheme" rather than "resolve"). -- Steven

Hi! On Sat, Dec 21, 2013 at 02:29:14PM -0800, Wolfgang <wolfgang.maier@biologie.uni-freiburg.de> wrote:
Not necessary. The user of the library can combine a few calls in a function/method and catch one exception for the entire calculation. Or catch it even higher up the stack.
With such special values the user must check every return value. What is the advantage over catching exceptions? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Hi Wolfgang, and thanks for the feedback! My responses below. On Sat, Dec 21, 2013 at 02:29:14PM -0800, Wolfgang wrote:
There was no agreement on the best way to handle data with multiple modes, so we went with the simplest version that could work. It's easier to add functionality to the standard library than to take it away: better to delay putting something in for a release or two, than to put it in and then be stuck with the consequences of a poor decision for years. An earlier version of statistics.py included a mode function that let you specify the maximum number of modes. That function may eventually be added to the module, or made available on PyPI. The version included in the standard library implements the basic, school-book version of mode: it returns the one unique mode, as calculated by counting distinct values, or it fails, and the most Pythonic way to implement failure is with an exception.
Effectively, that would force users to guard every (!) call to the function with try/except.
No different from any other function. If you think a function might fail, then you guard it with try...except.
You can easily get that behaviour with a simple wrapper function: def my_mode(values): try: return mode(values) except StatisticsError: return float('nan') But I'm not convinced that this is appropriate for nominal data. Would you expect that the mode of ['red', 'blue', 'green'] should be a floating point NAN? I know I wouldn't.
For median, I don't believe this is appropriate. As a general rule, if a function has a parameter which is usually called with a constant known when you write the source code: median(data, resolve='middle') # resolve is known at edit-time especially if that parameter takes only two or three values, then the function probably should be split into two or three separate functions. I don't think that there are any common use-cases for selecting the type of median at *runtime*: kind = get_median_kind() median(data, resolve=kind) but if you can think of any, I'd like to hear them. However, your general suggestion isn't entirely inappropriate. In my research, I learned that there are at least fifteen different definitions of quartiles in common use, although some are mathematically equivalent. See here: http://www.amstat.org/publications/jse/v14n3/langford.html I find six distinct definitions for quartiles, and ten for quantiles/ fractiles. R supports nine different quantile versions, Haskell six, and SAS also supports multiple versions. (I don't remember how many.) Mathematica provides a four-argument parameterized version of Quantile. With six distinct versions of quartile, and ten of quantile, it's too many to provide separate functions for each: too much duplication, too much clutter. Most people won't care which quantile they get, so there ought to be a sensible default. For those who care about matching some particular version (say, that used by Excel, or that used by Texas Instruments calculators), there ought to be a parameter that allows you to select which version is used. R calls this parameter "type". I don't remember what SAS and Haskell call it, but the term I prefer is "scheme". I don't know if statistics.py will ever gain a function for calculating quantiles other than the median. I will probably put quantiles and quartiles on PyPI first, and if I do, I will follow your suggestion to provide a parameter to select the version used (although I'll probably call it "scheme" rather than "resolve"). -- Steven
participants (4)
-
MRAB
-
Oleg Broytman
-
Steven D'Aprano
-
Wolfgang