[Python-ideas] thoughts on the new 3.4 statistics module

Wed Dec 25 01:47:25 CET 2013

Hi Wolfgang, and thanks for the feedback! My responses below.

On Sat, Dec 21, 2013 at 02:29:14PM -0800, Wolfgang wrote:

> First: I am not entirely convinced by when the module raises Errors. In 
> some places its undoubtedly justified to raise StatisticsError (like when 
> empty sequences are passed to mean()).
> On the other hand, should there really be an error, when for example no 
> unique value for the mode can be found?

There was no agreement on the best way to handle data with multiple 
modes, so we went with the simplest version that could work. It's easier 
to add functionality to the standard library than to take it away: 
better to delay putting something in for a release or two, than to put 
it in and then be stuck with the consequences of a poor decision for 
years.

An earlier version of statistics.py included a mode function that let 
you specify the maximum number of modes. That function may eventually be 
added to the module, or made available on PyPI. The version included in 
the standard library implements the basic, school-book version of mode: 
it returns the one unique mode, as calculated by counting distinct 
values, or it fails, and the most Pythonic way to implement failure is 
with an exception.

> Effectively, that would force users to guard every (!) call to the function 
> with try/except. 

No different from any other function. If you think a function might 
fail, then you guard it with try...except.

> In my opinion, a better choice would be to return 
> float('nan') or even better a module-specific object (call it Undefined or 
> something) that one can check for. This behavior could, in general, be 
> implemented for cases, where input can actually be handled and a result be 
> calculated (like a list of values in the mode example), but this result is 
> considered "undefined" by the algorithm.

You can easily get that behaviour with a simple wrapper function:

def my_mode(values):
    try:
        return mode(values)
    except StatisticsError:
        return float('nan')

But I'm not convinced that this is appropriate for nominal data. Would 
you expect that the mode of ['red', 'blue', 'green'] should be a 
floating point NAN? I know I wouldn't.

> Second: I am not entirely happy with the three different flavors of the 
> median function. I *do* know that this has been discussed before, but I'm 
> not sure whether *all* alternatives have been considered (the PEP only 
> talks about the median.low, median.high syntax, which, in fact, I wouldn't 
> like that much either. My suggestion would be to have a resolve parameter, 
> by which the behavior of a single median function can be modified.

For median, I don't believe this is appropriate. As a general rule, if a 
function has a parameter which is usually called with a constant known 
when you write the source code:

    median(data, resolve='middle')  # resolve is known at edit-time

especially if that parameter takes only two or three values, then the 
function probably should be split into two or three separate functions. 
I don't think that there are any common use-cases for selecting the type 
of median at *runtime*:

    kind = get_median_kind()
    median(data, resolve=kind)

but if you can think of any, I'd like to hear them.

However, your general suggestion isn't entirely inappropriate. In my 
research, I learned that there are at least fifteen different 
definitions of quartiles in common use, although some are mathematically 
equivalent. See here:

http://www.amstat.org/publications/jse/v14n3/langford.html

I find six distinct definitions for quartiles, and ten for quantiles/ 
fractiles. R supports nine different quantile versions, Haskell six, and 
SAS also supports multiple versions. (I don't remember how many.) 
Mathematica provides a four-argument parameterized version of Quantile.

With six distinct versions of quartile, and ten of quantile, it's too 
many to provide separate functions for each: too much duplication, too 
much clutter. Most people won't care which quantile they get, so there 
ought to be a sensible default. For those who care about matching some 
particular version (say, that used by Excel, or that used by Texas 
Instruments calculators), there ought to be a parameter that allows you 
to select which version is used. R calls this parameter "type". I don't 
remember what SAS and Haskell call it, but the term I prefer is 
"scheme".

I don't know if statistics.py will ever gain a function for calculating 
quantiles other than the median. I will probably put quantiles and 
quartiles on PyPI first, and if I do, I will follow your suggestion to 
provide a parameter to select the version used (although I'll probably 
call it "scheme" rather than "resolve").

-- 
Steven