statistics module in Python3.4

Dear all, I am still testing the new statistics module and I found two cases were the behavior of the module seems suboptimal to me. My most important concern is the module's internal _sum function and its implications, the other one about passing Counter objects to module functions. As for the first subject: Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:
mean((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5))) 1.9944444444444445
(this is because when _sum iterates over the input type Fraction wins over int, then float wins over Fraction and over everything else that follows in the first example, but in the second case Fraction wins over int, but then Fraction vs Decimal is undefined and throws an error). Confusing, isn't it? So here's the code of the _sum function: def _sum(data, start=0): """_sum(data [, start]) -> value Return a high-precision sum of the given numeric data. If optional argument ``start`` is given, it is added to the total. If ``data`` is empty, ``start`` (defaulting to 0) is returned. Examples -------- >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75) 11.0 Some sources of round-off error will be avoided: >>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero. 1000.0 Fractions and Decimals are also supported: >>> from fractions import Fraction as F >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)]) Fraction(63, 20) >>> from decimal import Decimal as D >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")] >>> _sum(data) Decimal('0.6963') """ n, d = _exact_ratio(start) T = type(start) partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. coerce_types = _coerce_types exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and track the "current" type. for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total) Internally, the function uses exact ratios for its calculations (which I think is very nice) and only goes through all the pain of coercing types to return T(total.numerator)/total.denominator where T is the final type resulting from the chain of conversions. I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way. My tentative implementation: def _sum2(data, start=None): if start is not None: t = set((type(start),)) n, d = _exact_ratio(start) else: t = set() n = 0 d = 1 partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and build up a set of all types. for x in data: t.add(type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n T = _coerce_types(t) # decide which type to use based on set of all types if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total) this leaves the re-implementation of _coerce_types. Personally, I'd prefer something as simple as possible, maybe even: def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float , but that's just a suggestion. In this case then:
_sum2((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))/6 1.9944444444444445
_sum2((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))/6 1.9944444444444445
lets check the examples from the _sum docstring just to be sure:
_sum2([3, 2.25, 4.5, -0.5, 1.0], 0.75) 11.0
_sum2([1e50, 1, -1e50] * 1000) # Built-in sum returns zero. 1000.0
Now the second issue: It is maybe more a matter of taste and concerns the effects of passing a Counter() object to various functions in the module. I know this is undocumented and it's probably the user's fault if he tries that, but still: tables)
But the truth is that only mode really works as you may think and we were just lucky with the other two:
I think there are two simple ways to avoid this pitfall: 1) add an explicit warning to the docs explaining this behavior or 2) make mean and median do the same magic with Counters as mode does, i.e. make them check for Counter as the input type and deal with it as if it were a frequency table. I'd favor this behavior because it looks like little extra code, but may be very useful in many situations. I'm not quite sure whether maybe even all mappings should be treated that way? Ok, that's it for now I guess. Opinions anyone? Best, Wolfgang

On 27/01/2014 17:41, Wolfgang wrote:
So this doesn't get lost I'd be inclined to raise two issues on the bug tracker. It's also much easier for people to follow the issues there and better still, see what the actual outcome is. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On 01/30/2014 03:27 PM, Mark Lawrence wrote:
Checking first is usually good policy, but now that you've had positive feed-back some issues on the bug tracker [1] is definitely a good idea. -- ~Ethan~ [1] http://bugs.python.org/issue?@template=item

On Mon, Jan 27, 2014 at 09:41:02AM -0800, Wolfgang wrote:
As the author of the module, I'm also concerned with the internal _sum function. That's why it's now a private function -- I originally intended for it to be a public function (see PEP 450).
I don't think so. The idea is that _sum() ought to reflect the standard, dare I say intuitive, behaviour of repeated application of the __add__ and __radd__ methods, as used by the plus operator. For example, int + <any numeric type> coerces to the other numeric type. What else would you expect? In mathematics the number 0.4 is the same whether you write it as 0.4, 2/5, 0.4+0j, [0; 2, 2] or any other notation you care to invent. (That last one is a continued fraction.) In Python, the number 0.4 is represented by a value and a type, and managing the coercion rules for the different types can be fiddly and annoying. But they shouldn't be *confusing* -- we have a numeric tower, and if I've written the code correctly, the coercion rules ought to follow the tower as closely as possible.
So here's the code of the _sum function: [...]
You should expect that to change, if for no other reason than performance. At the moment, _sum is about two orders of magnitude times slower than the built-in sum. I think I can get it to about one order of magnitude slower.
Thanks for this. I will add that to my collection of alternate versions of _sum.
I don't want to coerce everything to float unnecessarily. Floats are, in some ways, the worst choice for numeric values, at least from the perspective of accuracy and correctness. Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative: py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types. [...]
Interesting. If you think there is a use-case for passing Counters to the statistics functions (weighted data?) then perhaps they can be explicitly supported in 3.5. It's way too late for 3.4 to introduce new functionality. [...]
Well yes :-) I'm open to the suggestion that Counters should be treated specially. Would you be so kind as to raise an issue in the bug tracker? Thanks for the feedback, -- Steven

On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.) ChrisA

On Jan 30, 2014, at 17:32, Chris Angelico <rosuav@gmail.com> wrote:
Except that large enough int values lose information, and even larger ones raise an exception: >>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float And that first one is the reason why statistics needs a custom sum in the first place. When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other. So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?

On Fri, Jan 31, 2014 at 2:47 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
I don't think it'd be possible to forbid int -> float coercion - the Python community (and Steven himself) would raise an outcry. But int->float is at least as safe as it's fundamentally possible to be. Adding ".0" to the end of a literal (thus making it a float literal) is, AFAIK, absolutely identical to wrapping it in "float(" and ")". That's NOT true of float -> Fraction or float -> Decimal - going via float will cost precision, but going via int ought to be safe.
float(pow(3,50)) == pow(3.0,50) True
The difference between int and any other type is going to be pretty much the same whether you convert first or convert last. The only distinction that I can think of is floating-point rounding errors, which are already dealt with:
Since it handles this correctly with all floats, it'll handle it just fine with some ints and some floats:
In this case, the builtin sum() happens to be correct, because it adds the first ones as ints, and then converts to float at the end. Of course, "correct" isn't quite correct - the true value based on real number arithmetic is ...95, as can be seen in Python if they're all ints. But I'm defining "correct" as "the same result that would be obtained by calculating in real numbers and then converting to the data type of the end result". And by that definition, builtin sum() is correct as long as the float is right at the end, and statistics._sum() is correct regardless of the order.
So in that sense, it's "safe" to cast all int to float if the result is going to be float, unless an individual value is itself too big to convert, but the final result (thanks to negative values) would have been: I'm not sure how it's currently handled, but this particular case is working:
statistics._sum([1.0,1<<2000,0-(1<<2000)]) 1.0
The biggest problem, then, is cross-casting between float, Fraction, and Decimal. And anyone who's mixing those is asking for trouble already. ChrisA

On 31 January 2014 03:47, Andrew Barnert <abarnert@yahoo.com> wrote:
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type See also this thread (that I started during extensive off-list discussions about the statistics.sum function with Steven): https://mail.python.org/pipermail//python-ideas/2013-August/023034.html Decimal makes the following concessions for mixing numeric types: 1) It will promote integers in arithmetic. 2) It will compare correctly against all numeric types (as long as FloatOperation isn't trapped). 3) It will coerce int and float in its constructor. The recently added FloatOperation trap suggests that there's more interest in prohibiting the mixing of Decimals with other numeric types than facilitating it. I can imagine getting in that camp myself: speaking as someone who finds uses for both the fractions module and the decimal module I feel qualified to say that there is no good use case for mixing these types. Similarly there's no good use-case for mixing floats with Fractions or Decimals although mixing float/Fraction does work. If you choose to use Decimals then it is precisely because you do need to care about the numeric types you use and the sort of accuracy they provide. If you find yourself mixing Decimals with other numeric types then it's more likely a mistake/bug than a convenience. In any case the current implementation of statistics._sum (AIUI, I don't have it to hand for testing) will do the right thing for any mix of types in the numeric tower. It will also do the right thing for Decimals: it will compute the exact result and then round once according to the current decimal context. It's also possible to mix int and Decimal but there's no sensible way to handle mixing Decimal with anything else. If there is to be a documented limitation on mixing types then it should be explicitly about Decimal: The statistics module works very well with Decimal but doesn't really support mixing Decimal with other types. This is a limitation of Python rather than the statistics module itself. That being said I think that guaranteeing an error is better than the current order-dependent behaviour (and agree that that should be considered a bug). If there is to be a more drastic rearrangement of the _sum function then it should actually be to solve the problem that the current implementation of mean, variance etc. uses Fractions for all the heavy lifting but then rounds in the wrong place (when returning from _sum()) rather than in the mean, variance function itself. The clever algorithm in the variance function (unless it changed since I last looked) is entirely unnecessary when all of the intensive computation is performed with exact arithmetic. In the absence of rounding error you could compute a perfectly good variance using the computational formula for variance in a single pass. Similarly although the _sum() function is correctly rounded, the mean() function calls _sum() and then rounds again so that the return value from mean() is rounded twice. _sum() computes an exact value as a fraction and then coerces it with return T(total_numerator) / total_denominator so that the division causes it to be correctly rounded. However the mean function effectively ends up doing return (T(total_numerator) / total_denominator) / num_items which uses 2 divisions and hence rounds twice. It's trivial to rearrange that so that you round once return T(total_numerator) / (total_denominator * num_items) except that to do this the _sum function should be changed to return the exact result as a Fraction (and perhaps the type T). Similar changes would need to be made to the some of squares function (_ss() IIRC). The double rounding in mean() isn't a big deal but the corresponding effect for the variance functions is significant. It was after realising this that the sum function was renamed _sum and made nominally private. To be clear, statistics.variance(list_of_decimals) is very accurate. However it uses more passes than is necessary and it can be inaccurate in the situation that you have Decimals whose precision exceeds that of the current decimal context e.g.:
If you're using Fractions for all of your computation then you can change this since no precision is lost when calling Fraction(Decimal):
Oscar

On 1 February 2014 23:32, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
http://bugs.python.org/issue20481 now covers the concerns over avoiding making any guarantees that the current type coercion behaviour of the statistics module will be preserved indefinitely (it includes a link back to the archived copy of Oscar's post on mail.python.org). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan <ncoghlan@...> writes:
Thanks a lot, Nick, for all your efforts in filing the bugs. I just added a possible patch for http://bugs.python.org/issue20481 to the bug tracker. Best, Wolfgang

Chris Angelico <rosuav@...> writes:
Well, that's simple to stick to as long as you are dealing with explicitly typed input data sets, but what about things like: a = transform_a_series_of_data_somehow(data) b = transform_this_series_differently(data) statistics.mean(a+b) # assuming a and b are lists of transformed values potentially different types are far more difficult to spot here and the fact that the result of the above might not be the same as, e.g.,: statistics.mean(b+a) is not making things easier to debug.
(Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
It should be mentioned here that complex numbers are not currently dealt with by statistics._sum .
Best, Wolfgang

Steven D'Aprano writes:
Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
AFAIK it is.
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
This is a failure of associativity, not commutativity. Associativity is in many ways a more fundamental property.

On Fri, Jan 31, 2014 at 02:56:39PM +0900, Stephen J. Turnbull wrote:
Oops, you are correct. I got them mixed up. http://en.wikipedia.org/wiki/Associativity However, commutativity of addition can violated by Python numeric types, although not floats alone. E.g. the example I gave earlier of two int subclasses. -- Steven

On Jan 30, 2014, at 21:56, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Yeah, the only way commutativity can fail with IEEE floats is if you treat nan as a number and have at least two nans, at least one of them quiet. But associativity failing isn't really fundamental. This example fails as a consequence of the axiom of (additive) identity not holding. (There is a unique "zero", but it's not true that, for all y, x+y=y implies x is that zero.) The overflow example fails because of closure not holding (unless you count inf and nan as numbers, in which case it again fails because zero fails even more badly). If you just meant that you lose commutativity before associativity in compositions over fields, then yeah, I guess in that sense associativity is more fundamental.

On 27/01/2014 17:41, Wolfgang wrote:
So this doesn't get lost I'd be inclined to raise two issues on the bug tracker. It's also much easier for people to follow the issues there and better still, see what the actual outcome is. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On 01/30/2014 03:27 PM, Mark Lawrence wrote:
Checking first is usually good policy, but now that you've had positive feed-back some issues on the bug tracker [1] is definitely a good idea. -- ~Ethan~ [1] http://bugs.python.org/issue?@template=item

On Mon, Jan 27, 2014 at 09:41:02AM -0800, Wolfgang wrote:
As the author of the module, I'm also concerned with the internal _sum function. That's why it's now a private function -- I originally intended for it to be a public function (see PEP 450).
I don't think so. The idea is that _sum() ought to reflect the standard, dare I say intuitive, behaviour of repeated application of the __add__ and __radd__ methods, as used by the plus operator. For example, int + <any numeric type> coerces to the other numeric type. What else would you expect? In mathematics the number 0.4 is the same whether you write it as 0.4, 2/5, 0.4+0j, [0; 2, 2] or any other notation you care to invent. (That last one is a continued fraction.) In Python, the number 0.4 is represented by a value and a type, and managing the coercion rules for the different types can be fiddly and annoying. But they shouldn't be *confusing* -- we have a numeric tower, and if I've written the code correctly, the coercion rules ought to follow the tower as closely as possible.
So here's the code of the _sum function: [...]
You should expect that to change, if for no other reason than performance. At the moment, _sum is about two orders of magnitude times slower than the built-in sum. I think I can get it to about one order of magnitude slower.
Thanks for this. I will add that to my collection of alternate versions of _sum.
I don't want to coerce everything to float unnecessarily. Floats are, in some ways, the worst choice for numeric values, at least from the perspective of accuracy and correctness. Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative: py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types. [...]
Interesting. If you think there is a use-case for passing Counters to the statistics functions (weighted data?) then perhaps they can be explicitly supported in 3.5. It's way too late for 3.4 to introduce new functionality. [...]
Well yes :-) I'm open to the suggestion that Counters should be treated specially. Would you be so kind as to raise an issue in the bug tracker? Thanks for the feedback, -- Steven

On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.) ChrisA

On Jan 30, 2014, at 17:32, Chris Angelico <rosuav@gmail.com> wrote:
Except that large enough int values lose information, and even larger ones raise an exception: >>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float And that first one is the reason why statistics needs a custom sum in the first place. When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other. So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?

On Fri, Jan 31, 2014 at 2:47 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
I don't think it'd be possible to forbid int -> float coercion - the Python community (and Steven himself) would raise an outcry. But int->float is at least as safe as it's fundamentally possible to be. Adding ".0" to the end of a literal (thus making it a float literal) is, AFAIK, absolutely identical to wrapping it in "float(" and ")". That's NOT true of float -> Fraction or float -> Decimal - going via float will cost precision, but going via int ought to be safe.
float(pow(3,50)) == pow(3.0,50) True
The difference between int and any other type is going to be pretty much the same whether you convert first or convert last. The only distinction that I can think of is floating-point rounding errors, which are already dealt with:
Since it handles this correctly with all floats, it'll handle it just fine with some ints and some floats:
In this case, the builtin sum() happens to be correct, because it adds the first ones as ints, and then converts to float at the end. Of course, "correct" isn't quite correct - the true value based on real number arithmetic is ...95, as can be seen in Python if they're all ints. But I'm defining "correct" as "the same result that would be obtained by calculating in real numbers and then converting to the data type of the end result". And by that definition, builtin sum() is correct as long as the float is right at the end, and statistics._sum() is correct regardless of the order.
So in that sense, it's "safe" to cast all int to float if the result is going to be float, unless an individual value is itself too big to convert, but the final result (thanks to negative values) would have been: I'm not sure how it's currently handled, but this particular case is working:
statistics._sum([1.0,1<<2000,0-(1<<2000)]) 1.0
The biggest problem, then, is cross-casting between float, Fraction, and Decimal. And anyone who's mixing those is asking for trouble already. ChrisA

On 31 January 2014 03:47, Andrew Barnert <abarnert@yahoo.com> wrote:
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type See also this thread (that I started during extensive off-list discussions about the statistics.sum function with Steven): https://mail.python.org/pipermail//python-ideas/2013-August/023034.html Decimal makes the following concessions for mixing numeric types: 1) It will promote integers in arithmetic. 2) It will compare correctly against all numeric types (as long as FloatOperation isn't trapped). 3) It will coerce int and float in its constructor. The recently added FloatOperation trap suggests that there's more interest in prohibiting the mixing of Decimals with other numeric types than facilitating it. I can imagine getting in that camp myself: speaking as someone who finds uses for both the fractions module and the decimal module I feel qualified to say that there is no good use case for mixing these types. Similarly there's no good use-case for mixing floats with Fractions or Decimals although mixing float/Fraction does work. If you choose to use Decimals then it is precisely because you do need to care about the numeric types you use and the sort of accuracy they provide. If you find yourself mixing Decimals with other numeric types then it's more likely a mistake/bug than a convenience. In any case the current implementation of statistics._sum (AIUI, I don't have it to hand for testing) will do the right thing for any mix of types in the numeric tower. It will also do the right thing for Decimals: it will compute the exact result and then round once according to the current decimal context. It's also possible to mix int and Decimal but there's no sensible way to handle mixing Decimal with anything else. If there is to be a documented limitation on mixing types then it should be explicitly about Decimal: The statistics module works very well with Decimal but doesn't really support mixing Decimal with other types. This is a limitation of Python rather than the statistics module itself. That being said I think that guaranteeing an error is better than the current order-dependent behaviour (and agree that that should be considered a bug). If there is to be a more drastic rearrangement of the _sum function then it should actually be to solve the problem that the current implementation of mean, variance etc. uses Fractions for all the heavy lifting but then rounds in the wrong place (when returning from _sum()) rather than in the mean, variance function itself. The clever algorithm in the variance function (unless it changed since I last looked) is entirely unnecessary when all of the intensive computation is performed with exact arithmetic. In the absence of rounding error you could compute a perfectly good variance using the computational formula for variance in a single pass. Similarly although the _sum() function is correctly rounded, the mean() function calls _sum() and then rounds again so that the return value from mean() is rounded twice. _sum() computes an exact value as a fraction and then coerces it with return T(total_numerator) / total_denominator so that the division causes it to be correctly rounded. However the mean function effectively ends up doing return (T(total_numerator) / total_denominator) / num_items which uses 2 divisions and hence rounds twice. It's trivial to rearrange that so that you round once return T(total_numerator) / (total_denominator * num_items) except that to do this the _sum function should be changed to return the exact result as a Fraction (and perhaps the type T). Similar changes would need to be made to the some of squares function (_ss() IIRC). The double rounding in mean() isn't a big deal but the corresponding effect for the variance functions is significant. It was after realising this that the sum function was renamed _sum and made nominally private. To be clear, statistics.variance(list_of_decimals) is very accurate. However it uses more passes than is necessary and it can be inaccurate in the situation that you have Decimals whose precision exceeds that of the current decimal context e.g.:
If you're using Fractions for all of your computation then you can change this since no precision is lost when calling Fraction(Decimal):
Oscar

On 1 February 2014 23:32, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
http://bugs.python.org/issue20481 now covers the concerns over avoiding making any guarantees that the current type coercion behaviour of the statistics module will be preserved indefinitely (it includes a link back to the archived copy of Oscar's post on mail.python.org). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan <ncoghlan@...> writes:
Thanks a lot, Nick, for all your efforts in filing the bugs. I just added a possible patch for http://bugs.python.org/issue20481 to the bug tracker. Best, Wolfgang

Chris Angelico <rosuav@...> writes:
Well, that's simple to stick to as long as you are dealing with explicitly typed input data sets, but what about things like: a = transform_a_series_of_data_somehow(data) b = transform_this_series_differently(data) statistics.mean(a+b) # assuming a and b are lists of transformed values potentially different types are far more difficult to spot here and the fact that the result of the above might not be the same as, e.g.,: statistics.mean(b+a) is not making things easier to debug.
(Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
It should be mentioned here that complex numbers are not currently dealt with by statistics._sum .
Best, Wolfgang

Steven D'Aprano writes:
Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
AFAIK it is.
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
This is a failure of associativity, not commutativity. Associativity is in many ways a more fundamental property.

On Fri, Jan 31, 2014 at 02:56:39PM +0900, Stephen J. Turnbull wrote:
Oops, you are correct. I got them mixed up. http://en.wikipedia.org/wiki/Associativity However, commutativity of addition can violated by Python numeric types, although not floats alone. E.g. the example I gave earlier of two int subclasses. -- Steven

On Jan 30, 2014, at 21:56, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Yeah, the only way commutativity can fail with IEEE floats is if you treat nan as a number and have at least two nans, at least one of them quiet. But associativity failing isn't really fundamental. This example fails as a consequence of the axiom of (additive) identity not holding. (There is a unique "zero", but it's not true that, for all y, x+y=y implies x is that zero.) The overflow example fails because of closure not holding (unless you count inf and nan as numbers, in which case it again fails because zero fails even more badly). If you just meant that you lose commutativity before associativity in compositions over fields, then yeah, I guess in that sense associativity is more fundamental.
participants (11)
-
Andrew Barnert
-
Chris Angelico
-
Ethan Furman
-
Mark Lawrence
-
Nick Coghlan
-
Oscar Benjamin
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Wolfgang
-
Wolfgang Maier
-
Yury Selivanov