Re: [Python-ideas] statistics module in Python3.4
Oscar Benjamin
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type
You're perfectly right about this as far as built-in number types and the standard library types Fraction and Decimal are concerned.
That being said I think that guaranteeing an error is better than the current order-dependent behaviour (and agree that that should be considered a bug).
For custom types, the type returned by _sum can also be order-dependent due to this part in _coerce-types: def _coerce_types(T1, T2): [..] if issubclass(T2, float): return T2 if issubclass(T1, float): return T1 # Subclasses of the same base class give priority to the second. if T1.__base__ is T2.__base__: return T2 I chose the more drastic example with Fraction and Decimal for my initial post because there the difference is between a result and an error, but the above may illustrate better why I said that the returned type of _sum is hard to predict.
If there is to be a more drastic rearrangement of the _sum function then it should actually be to solve the problem that the current implementation of mean, variance etc. uses Fractions for all the heavy lifting but then rounds in the wrong place (when returning from _sum()) rather than in the mean, variance function itself.
This is an excellent remark and I agree absolutely with your point here. It's one of the aspects of the statistics module that I pondered over for weeks. Essentially, the fact that all current functions that rely on _sum do round imprecisely anyway was my motivation for suggesting the simple: def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float because it certainly makes sense to return the type found in the input if there is only one, but with ambiguity, why make the effort of guessing when it does not help precision anyway. However, I realized that I probably rushed this because the implementation of functions that call _sum may change later to rely on an exact return value.
The clever algorithm in the variance function (unless it changed since I last looked) is entirely unnecessary when all of the intensive computation is performed with exact arithmetic. In the absence of rounding error you could compute a perfectly good variance using the computational formula for variance in a single pass. Similarly although the _sum() function is correctly rounded, the mean() function calls _sum() and then rounds again so that the return value from mean() is rounded twice. _sum() computes an exact value as a fraction and then coerces it with
return T(total_numerator) / total_denominator
so that the division causes it to be correctly rounded. However the mean function effectively ends up doing
return (T(total_numerator) / total_denominator) / num_items
which uses 2 divisions and hence rounds twice. It's trivial to rearrange that so that you round once
return T(total_numerator) / (total_denominator * num_items)
except that to do this the _sum function should be changed to return the exact result as a Fraction (and perhaps the type T). Similar changes would need to be made to the some of squares function (_ss() IIRC). The double rounding in mean() isn't a big deal but the corresponding effect for the variance functions is significant. It was after realising this that the sum function was renamed _sum and made nominally private.
I have been thinking about this solution as well, but I think you really have to return a tuple of the sum as a Fraction and the type (not perhaps) since it would be really weird if the public functions in statistics always return a Fraction even if the input sequence consisted of only one standard type like int, float or Decimal. The obvious criticism then is that such a _sum is not really a sum function anymore like the existing ones. Then again, since this is a module private function it may be ok to do this? Best, Wolfgang
participants (1)
-
Wolfgang Maier