[Python-ideas] statistics module in Python3.4

Sat Feb 1 14:32:31 CET 2014

On 31 January 2014 03:47, Andrew Barnert <abarnert at yahoo.com> wrote:
> On Jan 30, 2014, at 17:32, Chris Angelico <rosuav at gmail.com> wrote:
>
>> On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>>> One of my aims is to avoid raising TypeError unnecessarily. The
>>> statistics module is aimed at casual users who may not understand, or
>>> care about, the subtleties of numeric coercions, they just want to take
>>> the average of two values regardless of what sort of number they are.
>>> But having said that, I realise that mixed-type arithmetic is difficult,
>>> and I've avoided documenting the fact that the module will work on mixed
>>> types.
>>
>> Based on the current docs and common sense, I would expect that
>> Fraction and Decimal should normally be there exclusively, and that
>> the only type coercions would be int->float->complex (because it makes
>> natural sense to write a list of "floats" as [1.4, 2, 3.7], but it
>> doesn't make sense to write a list of Fractions as [Fraction(1,2),
>> 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with
>> the other three types can be answered with "Well, you should be using
>> the same type everywhere". (Though it might be useful to allow
>> int->anything coercion, since that one's easy and safe.)
>
> Except that large enough int values lose information, and even larger ones raise an exception:
>
>     >>> float(pow(3, 50)) == pow(3, 50)
>     False
>     >>> float(1<<2000)
>     OverflowError: int too large to convert to float
>
> And that first one is the reason why statistics needs a custom sum in the first place.
>
> When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other.
>
> So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?

You're making this sound a lot more complicated than it is. The
problem is simple: Decimal doesn't integrate with the numeric tower.
This is explicit in the PEP that brought in the numeric tower:
http://www.python.org/dev/peps/pep-3141/#the-decimal-type

See also this thread (that I started during extensive off-list
discussions about the statistics.sum function with Steven):
https://mail.python.org/pipermail//python-ideas/2013-August/023034.html

Decimal makes the following concessions for mixing numeric types:
1) It will promote integers in arithmetic.
2) It will compare correctly against all numeric types (as long as
FloatOperation isn't trapped).
3) It will coerce int and float in its constructor.

The recently added FloatOperation trap suggests that there's more
interest in prohibiting the mixing of Decimals with other numeric
types than facilitating it. I can imagine getting in that camp myself:
speaking as someone who finds uses for both the fractions module and
the decimal module I feel qualified to say that there is no good use
case for mixing these types. Similarly there's no good use-case for
mixing floats with Fractions or Decimals although mixing
float/Fraction does work. If you choose to use Decimals then it is
precisely because you do need to care about the numeric types you use
and the sort of accuracy they provide. If you find yourself mixing
Decimals with other numeric types then it's more likely a mistake/bug
than a convenience.

In any case the current implementation of statistics._sum (AIUI, I
don't have it to hand for testing) will do the right thing for any mix
of types in the numeric tower. It will also do the right thing for
Decimals: it will compute the exact result and then round once
according to the current decimal context. It's also possible to mix
int and Decimal but there's no sensible way to handle mixing Decimal
with anything else.

If there is to be a documented limitation on mixing types then it
should be explicitly about Decimal: The statistics module works very
well with Decimal but doesn't really support mixing Decimal with other
types. This is a limitation of Python rather than the statistics
module itself. That being said I think that guaranteeing an error is
better than the current order-dependent behaviour (and agree that that
should be considered a bug).

If there is to be a more drastic rearrangement of the _sum function
then it should actually be to solve the problem that the current
implementation of mean, variance etc. uses Fractions for all the heavy
lifting but then rounds in the wrong place (when returning from
_sum()) rather than in the mean, variance function itself.

The clever algorithm in the variance function (unless it changed since
I last looked) is entirely unnecessary when all of the intensive
computation is performed with exact arithmetic. In the absence of
rounding error you could compute a perfectly good variance using the
computational formula for variance in a single pass. Similarly
although the _sum() function is correctly rounded, the mean() function
calls _sum() and then rounds again so that the return value from
mean() is rounded twice. _sum() computes an exact value as a fraction
and then coerces it with

    return T(total_numerator) / total_denominator

so that the division causes it to be correctly rounded. However the
mean function effectively ends up doing

     return (T(total_numerator) / total_denominator) / num_items

which uses 2 divisions and hence rounds twice. It's trivial to
rearrange that so that you round once

    return T(total_numerator) / (total_denominator * num_items)

except that to do this the _sum function should be changed to return
the exact result as a Fraction (and perhaps the type T). Similar
changes would need to be made to the some of squares function (_ss()
IIRC). The double rounding in mean() isn't a big deal but the
corresponding effect for the variance functions is significant. It was
after realising this that the sum function was renamed _sum and made
nominally private.

To be clear, statistics.variance(list_of_decimals) is very accurate.
However it uses more passes than is necessary and it can be inaccurate
in the situation that you have Decimals whose precision exceeds that
of the current decimal context e.g.:

>>> import decimal
>>> d = decimal.Decimal('300000000000000000000000000000000000000000')
>>> d
Decimal('300000000000000000000000000000000000000000')
>>> d+1   # Any arithmetic operation loses precision
Decimal('3.000000000000000000000000000E+41')
>>> +d  # Use context precision
Decimal('3.000000000000000000000000000E+41')

If you're using Fractions for all of your computation then you can
change this since no precision is lost when calling Fraction(Decimal):

>>> import fractions
>>> fractions.Fraction(d)+1
Fraction(300000000000000000000000000000000000000001, 1)

Oscar