[Python-Dev] PEP 450 adding statistics module

Fri Aug 16 11:51:56 CEST 2013

On 15/08/13 14:08, Steven D'Aprano wrote:
> On 15/08/13 21:42, Mark Dickinson wrote:
>> The PEP and code look generally good to me.
>>
>> I think the API for median and its variants deserves some wider discussion:
>> the reference implementation has a callable 'median', and variant callables
>> 'median.low', 'median.high', 'median.grouped'.  The pattern of attaching
>> the variant callables as attributes on the main callable is unusual, and
>> isn't something I've seen elsewhere in the standard library.  I'd like to
>> see some explanation in the PEP for why it's done this way.  (There was
>> already some discussion of this on the issue, but that was more centered
>> around the implementation than the API.)
>>
>> I'd propose two alternatives for this:  either have separate functions
>> 'median', 'median_low', 'median_high', etc., or have a single function
>> 'median' with a "method" argument that takes a string specifying
>> computation using a particular method.  I don't see a really good reason to
>> deviate from standard patterns here, and fear that users would find the
>> current API surprising.
>
> Alexander Belopolsky has convinced me (off-list) that my current implementation is better changed to a more conservative one of a callable singleton instance with methods implementing the alternative
> computations. I'll have something like:
>
>
> def _singleton(cls):
>      return cls()
>
>
> @_singleton
> class median:
>      def __call__(self, data):
>          ...
>      def low(self, data):
>          ...
>      ...
>
Horrible.

>
> In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme":
>
> median(data, scheme="low")

What is wrong with this?
It's a perfect API; simple and self-explanatory.
median is a function in the mathematical sense and it should be a function in Python.

>
> R uses parameter called "type" to choose between alternate calculations, not for median as we are discussing, but for quantiles:
>
> quantile(x, probs ... type = 7, ...).
>
> SAS also uses a similar system, but with different numeric codes. I rejected both "type" and "method" as the parameter name since it would cause confusion with the usual meanings of those words. I
> eventually decided against this system for two reasons:

There are other words to choose from ;) "scheme" seems OK to me.

>
> - Each scheme ended up needing to be a separate function, for ease of both implementation and testing. So I had four private median functions, which I put inside a class to act as namespace and avoid
> polluting the main namespace. Then I needed a "master function" to select which of the methods should be called, with all the additional testing and documentation that entailed.
>
> - The API doesn't really feel very Pythonic to me. For example, we write:
>
> mystring.rjust(width)
> dict.items()
These are methods on objects, the result of these calls depends on the value of 'self' argument, not merely its class. No so with a median singleton.

We also have len(seq) and copy.copy(obj)
No classes required.

>
> rather than mystring.justify(width, "right") or dict.iterate("items"). So I think individual methods is a better API, and one which is more familiar to most Python users. The only innovation (if
> that's what it is) is to have median a callable object.
>
>
> As far as having four separate functions, median, median_low, etc., it just doesn't feel right to me. It puts four slight variations of the same function into the main namespace, instead of keeping
> them together in a namespace. Names like median_low merely simulates a namespace with pseudo-methods separated with underscores instead of dots, only without the advantages of a real namespace.
>
> (I treat variance and std dev differently, and make the sample and population forms separate top-level functions rather than methods, simply because they are so well-known from scientific calculators
> that it is unthinkable to me to do differently. Whenever I use numpy, I am surprised all over again that it has only a single variance function.)
>
>
>