On 31 January 2014 03:47, Andrew Barnert email@example.com wrote:
On Jan 30, 2014, at 17:32, Chris Angelico firstname.lastname@example.org wrote:
On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano email@example.com wrote:
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
Except that large enough int values lose information, and even larger ones raise an exception:
>>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float
And that first one is the reason why statistics needs a custom sum in the first place.
When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other.
So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type
See also this thread (that I started during extensive off-list discussions about the statistics.sum function with Steven): https://mail.python.org/pipermail//python-ideas/2013-August/023034.html
Decimal makes the following concessions for mixing numeric types: 1) It will promote integers in arithmetic. 2) It will compare correctly against all numeric types (as long as FloatOperation isn't trapped). 3) It will coerce int and float in its constructor.
The recently added FloatOperation trap suggests that there's more interest in prohibiting the mixing of Decimals with other numeric types than facilitating it. I can imagine getting in that camp myself: speaking as someone who finds uses for both the fractions module and the decimal module I feel qualified to say that there is no good use case for mixing these types. Similarly there's no good use-case for mixing floats with Fractions or Decimals although mixing float/Fraction does work. If you choose to use Decimals then it is precisely because you do need to care about the numeric types you use and the sort of accuracy they provide. If you find yourself mixing Decimals with other numeric types then it's more likely a mistake/bug than a convenience.
In any case the current implementation of statistics._sum (AIUI, I don't have it to hand for testing) will do the right thing for any mix of types in the numeric tower. It will also do the right thing for Decimals: it will compute the exact result and then round once according to the current decimal context. It's also possible to mix int and Decimal but there's no sensible way to handle mixing Decimal with anything else.
If there is to be a documented limitation on mixing types then it should be explicitly about Decimal: The statistics module works very well with Decimal but doesn't really support mixing Decimal with other types. This is a limitation of Python rather than the statistics module itself. That being said I think that guaranteeing an error is better than the current order-dependent behaviour (and agree that that should be considered a bug).
If there is to be a more drastic rearrangement of the _sum function then it should actually be to solve the problem that the current implementation of mean, variance etc. uses Fractions for all the heavy lifting but then rounds in the wrong place (when returning from _sum()) rather than in the mean, variance function itself.
The clever algorithm in the variance function (unless it changed since I last looked) is entirely unnecessary when all of the intensive computation is performed with exact arithmetic. In the absence of rounding error you could compute a perfectly good variance using the computational formula for variance in a single pass. Similarly although the _sum() function is correctly rounded, the mean() function calls _sum() and then rounds again so that the return value from mean() is rounded twice. _sum() computes an exact value as a fraction and then coerces it with
return T(total_numerator) / total_denominator
so that the division causes it to be correctly rounded. However the mean function effectively ends up doing
return (T(total_numerator) / total_denominator) / num_items
which uses 2 divisions and hence rounds twice. It's trivial to rearrange that so that you round once
return T(total_numerator) / (total_denominator * num_items)
except that to do this the _sum function should be changed to return the exact result as a Fraction (and perhaps the type T). Similar changes would need to be made to the some of squares function (_ss() IIRC). The double rounding in mean() isn't a big deal but the corresponding effect for the variance functions is significant. It was after realising this that the sum function was renamed _sum and made nominally private.
To be clear, statistics.variance(list_of_decimals) is very accurate. However it uses more passes than is necessary and it can be inaccurate in the situation that you have Decimals whose precision exceeds that of the current decimal context e.g.:
import decimal d = decimal.Decimal('300000000000000000000000000000000000000000') d
d+1 # Any arithmetic operation loses precision
+d # Use context precision
If you're using Fractions for all of your computation then you can change this since no precision is lost when calling Fraction(Decimal):
import fractions fractions.Fraction(d)+1