[Python-ideas] statistics module in Python3.4

Fri Jan 31 05:36:16 CET 2014

On Fri, Jan 31, 2014 at 2:47 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> Based on the current docs and common sense, I would expect that
>> Fraction and Decimal should normally be there exclusively, and that
>> the only type coercions would be int->float->complex (because it makes
>> natural sense to write a list of "floats" as [1.4, 2, 3.7], but it
>> doesn't make sense to write a list of Fractions as [Fraction(1,2),
>> 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with
>> the other three types can be answered with "Well, you should be using
>> the same type everywhere". (Though it might be useful to allow
>> int->anything coercion, since that one's easy and safe.)
>
> Except that large enough int values lose information, and even larger ones raise an exception:
>
>     >>> float(pow(3, 50)) == pow(3, 50)
>     False
>     >>> float(1<<2000)
>     OverflowError: int too large to convert to float
>
> And that first one is the reason why statistics needs a custom sum in the first place.

I don't think it'd be possible to forbid int -> float coercion - the
Python community (and Steven himself) would raise an outcry. But
int->float is at least as safe as it's fundamentally possible to be.
Adding ".0" to the end of a literal (thus making it a float literal)
is, AFAIK, absolutely identical to wrapping it in "float(" and ")".
That's NOT true of float -> Fraction or float -> Decimal - going via
float will cost precision, but going via int ought to be safe.

>>> float(pow(3,50)) == pow(3.0,50)
True

The difference between int and any other type is going to be pretty
much the same whether you convert first or convert last. The only
distinction that I can think of is floating-point rounding errors,
which are already dealt with:

>>> statistics._sum([pow(2.0,53),1.0,1.0,1.0])
9007199254740996.0
>>> sum([pow(2.0,53),1.0,1.0,1.0])
9007199254740992.0

Since it handles this correctly with all floats, it'll handle it just
fine with some ints and some floats:

>>> sum([pow(2,53),1,1,1.0])
9007199254740996.0
>>> statistics._sum([pow(2,53),1,1,1.0])
9007199254740996.0

In this case, the builtin sum() happens to be correct, because it adds
the first ones as ints, and then converts to float at the end. Of
course, "correct" isn't quite correct - the true value based on real
number arithmetic is ...95, as can be seen in Python if they're all
ints. But I'm defining "correct" as "the same result that would be
obtained by calculating in real numbers and then converting to the
data type of the end result". And by that definition, builtin sum() is
correct as long as the float is right at the end, and
statistics._sum() is correct regardless of the order.

>>> statistics._sum([1.0,pow(2,53),1,1])
9007199254740996.0
>>> sum([1.0,pow(2,53),1,1])
9007199254740992.0

So in that sense, it's "safe" to cast all int to float if the result
is going to be float, unless an individual value is itself too big to
convert, but the final result (thanks to negative values) would have
been: I'm not sure how it's currently handled, but this particular
case is working:

>>> statistics._sum([1.0,1<<2000,0-(1<<2000)])
1.0

The biggest problem, then, is cross-casting between float, Fraction,
and Decimal. And anyone who's mixing those is asking for trouble
already.

ChrisA