[Python-ideas] statistics module in Python3.4
Steven D'Aprano
steve at pearwood.info
Fri Jan 31 02:07:25 CET 2014
On Mon, Jan 27, 2014 at 09:41:02AM -0800, Wolfgang wrote:
> Dear all,
> I am still testing the new statistics module and I found two cases were the
> behavior of the module seems suboptimal to me.
> My most important concern is the module's internal _sum function and its
> implications, the other one about passing Counter objects to module
> functions.
As the author of the module, I'm also concerned with the internal _sum
function. That's why it's now a private function -- I originally
intended for it to be a public function (see PEP 450).
> As for the first subject:
> Specifically, I am not happy with the way the function handles different
> types. Currently _coerce_types gets called for every element in the
> function's input sequence and type conversion follows quite complicated
> rules, and - what is worst - make the outcome of _sum() and thereby mean()
> dependent on the order of items in the input sequence, e.g.:
[...]
> (this is because when _sum iterates over the input type Fraction wins over
> int, then float wins over Fraction and over everything else that follows in
> the first example, but in the second case Fraction wins over int, but then
> Fraction vs Decimal is undefined and throws an error).
>
> Confusing, isn't it?
I don't think so. The idea is that _sum() ought to reflect the standard,
dare I say intuitive, behaviour of repeated application of the __add__
and __radd__ methods, as used by the plus operator. For example, int +
<any numeric type> coerces to the other numeric type. What else would
you expect?
In mathematics the number 0.4 is the same whether you write it as 0.4,
2/5, 0.4+0j, [0; 2, 2] or any other notation you care to invent. (That
last one is a continued fraction.) In Python, the number 0.4 is
represented by a value and a type, and managing the coercion rules for
the different types can be fiddly and annoying. But they shouldn't be
*confusing* -- we have a numeric tower, and if I've written the code
correctly, the coercion rules ought to follow the tower as closely as
possible.
> So here's the code of the _sum function:
[...]
You should expect that to change, if for no other reason than
performance. At the moment, _sum is about two orders of magnitude times
slower than the built-in sum. I think I can get it to about one order of
magnitude slower.
> I think a much cleaner (and probably faster) implementation would be to
> gather first all the types in the input sequence, then decide what to
> return in an input order independent way. My tentative implementation:
[...]
Thanks for this. I will add that to my collection of alternate versions
of _sum.
> this leaves the re-implementation of _coerce_types. Personally, I'd prefer
> something as simple as possible, maybe even:
>
> def _coerce_types (types):
> if len(types) == 1:
> return next(iter(types))
> return float
I don't want to coerce everything to float unnecessarily. Floats are, in
some ways, the worst choice for numeric values, at least from the
perspective of accuracy and correctness. Floats violate several of the
fundamental rules of mathematics, e.g. addition is not commutative:
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1
False
One of my aims is to avoid raising TypeError unnecessarily. The
statistics module is aimed at casual users who may not understand, or
care about, the subtleties of numeric coercions, they just want to take
the average of two values regardless of what sort of number they are.
But having said that, I realise that mixed-type arithmetic is difficult,
and I've avoided documenting the fact that the module will work on mixed
types.
[...]
> Now the second issue:
> It is maybe more a matter of taste and concerns the effects of passing a
> Counter() object to various functions in the module.
Interesting. If you think there is a use-case for passing Counters to
the statistics functions (weighted data?) then perhaps they can be
explicitly supported in 3.5. It's way too late for 3.4 to introduce new
functionality.
[...]
> From a quick look at the code you can see that mode actually converts your
> input to a Counter behind the scenes anyway, so it has no problem.
> mean and median, on the other hand, are simply iterating over their input,
> so if that input happens to be a mapping, they'll use just the keys.
Well yes :-)
I'm open to the suggestion that Counters should be treated specially.
Would you be so kind as to raise an issue in the bug tracker?
Thanks for the feedback,
--
Steven
More information about the Python-ideas
mailing list