[Python-ideas] statistics module in Python3.4

Fri Jan 31 02:07:25 CET 2014

On Mon, Jan 27, 2014 at 09:41:02AM -0800, Wolfgang wrote:
> Dear all,
> I am still testing the new statistics module and I found two cases were the 
> behavior of the module seems suboptimal to me.
> My most important concern is the module's internal _sum function and its 
> implications, the other one about passing Counter objects to module 
> functions.

As the author of the module, I'm also concerned with the internal _sum 
function. That's why it's now a private function -- I originally 
intended for it to be a public function (see PEP 450).

> As for the first subject:
> Specifically, I am not happy with the way the function handles different 
> types. Currently _coerce_types gets called for every element in the 
> function's input sequence and type conversion follows quite complicated 
> rules, and - what is worst - make the outcome of _sum() and thereby mean() 
> dependent on the order of items in the input sequence, e.g.:
[...]
> (this is because when _sum iterates over the input type Fraction wins over 
> int, then float wins over Fraction and over everything else that follows in 
> the first example, but in the second case Fraction wins over int, but then 
> Fraction vs Decimal is undefined and throws an error).
> 
> Confusing, isn't it? 

I don't think so. The idea is that _sum() ought to reflect the standard, 
dare I say intuitive, behaviour of repeated application of the __add__ 
and __radd__ methods, as used by the plus operator. For example, int + 
<any numeric type> coerces to the other numeric type. What else would 
you expect?

In mathematics the number 0.4 is the same whether you write it as 0.4, 
2/5, 0.4+0j, [0; 2, 2] or any other notation you care to invent. (That 
last one is a continued fraction.) In Python, the number 0.4 is 
represented by a value and a type, and managing the coercion rules for 
the different types can be fiddly and annoying. But they shouldn't be 
*confusing* -- we have a numeric tower, and if I've written the code 
correctly, the coercion rules ought to follow the tower as closely as 
possible.

> So here's the code of the _sum function:
[...]

You should expect that to change, if for no other reason than 
performance. At the moment, _sum is about two orders of magnitude times 
slower than the built-in sum. I think I can get it to about one order of 
magnitude slower.

> I think a much cleaner (and probably faster) implementation would be to 
> gather first all the types in the input sequence, then decide what to 
> return in an input order independent way. My tentative implementation:
[...]

Thanks for this. I will add that to my collection of alternate versions 
of _sum.

> this leaves the re-implementation of _coerce_types. Personally, I'd prefer 
> something as simple as possible, maybe even:
> 
> def _coerce_types (types):
>     if len(types) == 1:
>         return next(iter(types))
>     return float

I don't want to coerce everything to float unnecessarily. Floats are, in 
some ways, the worst choice for numeric values, at least from the 
perspective of accuracy and correctness. Floats violate several of the 
fundamental rules of mathematics, e.g. addition is not commutative:

py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1
False

One of my aims is to avoid raising TypeError unnecessarily. The 
statistics module is aimed at casual users who may not understand, or 
care about, the subtleties of numeric coercions, they just want to take 
the average of two values regardless of what sort of number they are. 
But having said that, I realise that mixed-type arithmetic is difficult, 
and I've avoided documenting the fact that the module will work on mixed 
types.

[...]
> Now the second issue:
> It is maybe more a matter of taste and concerns the effects of passing a 
> Counter() object to various functions in the module.

Interesting. If you think there is a use-case for passing Counters to 
the statistics functions (weighted data?) then perhaps they can be 
explicitly supported in 3.5. It's way too late for 3.4 to introduce new 
functionality.

[...]
> From a quick look at the code you can see that mode actually converts your 
> input to a Counter behind the scenes anyway, so it has no problem.
> mean and median, on the other hand, are simply iterating over their input, 
> so if that input happens to be a mapping, they'll use just the keys.

Well yes :-)

I'm open to the suggestion that Counters should be treated specially. 
Would you be so kind as to raise an issue in the bug tracker?

Thanks for the feedback,

-- 
Steven