Dear all,

I am still testing the new statistics module and I found two cases were the behavior of the module seems suboptimal to me.

My most important concern is the module's internal _sum function and its implications, the other one about passing Counter objects to module functions.

As for the first subject:

Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:

>>> mean((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))

1.9944444444444445

>>> mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))

Traceback (most recent call last):

File "<pyshell#7>", line 1, in <module>

mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))

File "C:\Python33\statistics.py", line 369, in mean

return _sum(data)/n

File "C:\Python33\statistics.py", line 157, in _sum

T = _coerce_types(T, type(x))

File "C:\Python33\statistics.py", line 327, in _coerce_types

raise TypeError('cannot coerce types %r and %r' % (T1, T2))

TypeError: cannot coerce types <class 'fractions.Fraction'> and <class 'decimal.Decimal'>

(this is because when _sum iterates over the input type Fraction wins over int, then float wins over Fraction and over everything else that follows in the first example, but in the second case Fraction wins over int, but then Fraction vs Decimal is undefined and throws an error).

Confusing, isn't it? So here's the code of the _sum function:

def _sum(data, start=0):

"""_sum(data [, start]) -> value

Return a high-precision sum of the given numeric data. If optional

argument ``start`` is given, it is added to the total. If ``data`` is

empty, ``start`` (defaulting to 0) is returned.

Examples

--------

>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)

11.0

Some sources of round-off error will be avoided:

>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.

1000.0

Fractions and Decimals are also supported:

>>> from fractions import Fraction as F

>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])

Fraction(63, 20)

>>> from decimal import Decimal as D

>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]

>>> _sum(data)

Decimal('0.6963')

"""

n, d = _exact_ratio(start)

T = type(start)

partials = {d: n} # map {denominator: sum of numerators}

# Micro-optimizations.

coerce_types = _coerce_types

exact_ratio = _exact_ratio

partials_get = partials.get

# Add numerators for each denominator, and track the "current" type.

for x in data:

T = _coerce_types(T, type(x))

n, d = exact_ratio(x)

partials[d] = partials_get(d, 0) + n

if None in partials:

assert issubclass(T, (float, Decimal))

assert not math.isfinite(partials[None])

return T(partials[None])

total = Fraction()

for d, n in sorted(partials.items()):

total += Fraction(n, d)

if issubclass(T, int):

assert total.denominator == 1

return T(total.numerator)

if issubclass(T, Decimal):

return T(total.numerator)/total.denominator

return T(total)

Internally, the function uses exact ratios for its calculations (which I think is very nice) and only goes through all the pain of coercing types to return

T(total.numerator)/total.denominator

where T is the final type resulting from the chain of conversions.

I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way. My tentative implementation:

def _sum2(data, start=None):

if start is not None:

t = set((type(start),))

n, d = _exact_ratio(start)

else:

t = set()

n = 0

d = 1

partials = {d: n} # map {denominator: sum of numerators}

# Micro-optimizations.

exact_ratio = _exact_ratio

partials_get = partials.get

# Add numerators for each denominator, and build up a set of all types.

for x in data:

t.add(type(x))

n, d = exact_ratio(x)

partials[d] = partials_get(d, 0) + n

T = _coerce_types(t) # decide which type to use based on set of all types

if None in partials:

assert issubclass(T, (float, Decimal))

assert not math.isfinite(partials[None])

return T(partials[None])

total = Fraction()

for d, n in sorted(partials.items()):

total += Fraction(n, d)

if issubclass(T, int):

assert total.denominator == 1

return T(total.numerator)

if issubclass(T, Decimal):

return T(total.numerator)/total.denominator

return T(total)

this leaves the re-implementation of _coerce_types. Personally, I'd prefer something as simple as possible, maybe even:

def _coerce_types (types):

if len(types) == 1:

return next(iter(types))

return float

, but that's just a suggestion.

In this case then:

>>> _sum2((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))/6

1.9944444444444445

>>> _sum2((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))/6

1.9944444444444445

lets check the examples from the _sum docstring just to be sure:

>>> _sum2([3, 2.25, 4.5, -0.5, 1.0], 0.75)

11.0

>>> _sum2([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.

1000.0

>>> from fractions import Fraction as F

>>> _sum2([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])

Fraction(63, 20)

>>> from decimal import Decimal as D

>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]

>>> _sum2(data)

Decimal('0.6963')

Now the second issue:

It is maybe more a matter of taste and concerns the effects of passing a Counter() object to various functions in the module.

I know this is undocumented and it's probably the user's fault if he tries that, but still:

>>> from collections import Counter

>>> c=Counter((1,1,1,1,2,2,2,2,2,3,3,3,3))

>>> c

Counter({1: 4, 2: 5, 3: 4})

>>> mode(c)

2

Cool, mode knows how to work with Counters (interpreting them as frequency tables)

>>> median(c)

2

Looks good

>>> mean(c)

2.0

Very well

But the truth is that only mode really works as you may think and we were just lucky with the other two:

>>> c=Counter((1,1,2))

>>> mean(c)

1.5

oops

>>> median(c)

1.5

hmm

From a quick look at the code you can see that mode actually converts your input to a Counter behind the scenes anyway, so it has no problem.

mean and median, on the other hand, are simply iterating over their input, so if that input happens to be a mapping, they'll use just the keys.

I think there are two simple ways to avoid this pitfall:

1) add an explicit warning to the docs explaining this behavior or

2) make mean and median do the same magic with Counters as mode does, i.e. make them check for Counter as the input type and deal with it as if it were a frequency table. I'd favor this behavior because it looks like little extra code, but may be very useful in many situations. I'm not quite sure whether maybe even all mappings should be treated that way?

Ok, that's it for now I guess. Opinions anyone?

Best,

Wolfgang

I am still testing the new statistics module and I found two cases were the behavior of the module seems suboptimal to me.

My most important concern is the module's internal _sum function and its implications, the other one about passing Counter objects to module functions.

As for the first subject:

Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:

>>> mean((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))

1.9944444444444445

>>> mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))

Traceback (most recent call last):

File "<pyshell#7>", line 1, in <module>

mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))

File "C:\Python33\statistics.py", line 369, in mean

return _sum(data)/n

File "C:\Python33\statistics.py", line 157, in _sum

T = _coerce_types(T, type(x))

File "C:\Python33\statistics.py", line 327, in _coerce_types

raise TypeError('cannot coerce types %r and %r' % (T1, T2))

TypeError: cannot coerce types <class 'fractions.Fraction'> and <class 'decimal.Decimal'>

(this is because when _sum iterates over the input type Fraction wins over int, then float wins over Fraction and over everything else that follows in the first example, but in the second case Fraction wins over int, but then Fraction vs Decimal is undefined and throws an error).

Confusing, isn't it? So here's the code of the _sum function:

def _sum(data, start=0):

"""_sum(data [, start]) -> value

Return a high-precision sum of the given numeric data. If optional

argument ``start`` is given, it is added to the total. If ``data`` is

empty, ``start`` (defaulting to 0) is returned.

Examples

--------

>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)

11.0

Some sources of round-off error will be avoided:

>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.

1000.0

Fractions and Decimals are also supported:

>>> from fractions import Fraction as F

>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])

Fraction(63, 20)

>>> from decimal import Decimal as D

>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]

>>> _sum(data)

Decimal('0.6963')

"""

n, d = _exact_ratio(start)

T = type(start)

partials = {d: n} # map {denominator: sum of numerators}

# Micro-optimizations.

coerce_types = _coerce_types

exact_ratio = _exact_ratio

partials_get = partials.get

# Add numerators for each denominator, and track the "current" type.

for x in data:

T = _coerce_types(T, type(x))

n, d = exact_ratio(x)

partials[d] = partials_get(d, 0) + n

if None in partials:

assert issubclass(T, (float, Decimal))

assert not math.isfinite(partials[None])

return T(partials[None])

total = Fraction()

for d, n in sorted(partials.items()):

total += Fraction(n, d)

if issubclass(T, int):

assert total.denominator == 1

return T(total.numerator)

if issubclass(T, Decimal):

return T(total.numerator)/total.denominator

return T(total)

Internally, the function uses exact ratios for its calculations (which I think is very nice) and only goes through all the pain of coercing types to return

T(total.numerator)/total.denominator

where T is the final type resulting from the chain of conversions.

I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way. My tentative implementation:

def _sum2(data, start=None):

if start is not None:

t = set((type(start),))

n, d = _exact_ratio(start)

else:

t = set()

n = 0

d = 1

partials = {d: n} # map {denominator: sum of numerators}

# Micro-optimizations.

exact_ratio = _exact_ratio

partials_get = partials.get

# Add numerators for each denominator, and build up a set of all types.

for x in data:

t.add(type(x))

n, d = exact_ratio(x)

partials[d] = partials_get(d, 0) + n

T = _coerce_types(t) # decide which type to use based on set of all types

if None in partials:

assert issubclass(T, (float, Decimal))

assert not math.isfinite(partials[None])

return T(partials[None])

total = Fraction()

for d, n in sorted(partials.items()):

total += Fraction(n, d)

if issubclass(T, int):

assert total.denominator == 1

return T(total.numerator)

if issubclass(T, Decimal):

return T(total.numerator)/total.denominator

return T(total)

this leaves the re-implementation of _coerce_types. Personally, I'd prefer something as simple as possible, maybe even:

def _coerce_types (types):

if len(types) == 1:

return next(iter(types))

return float

, but that's just a suggestion.

In this case then:

>>> _sum2((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))/6

1.9944444444444445

>>> _sum2((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))/6

1.9944444444444445

lets check the examples from the _sum docstring just to be sure:

>>> _sum2([3, 2.25, 4.5, -0.5, 1.0], 0.75)

11.0

>>> _sum2([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.

1000.0

>>> from fractions import Fraction as F

>>> _sum2([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])

Fraction(63, 20)

>>> from decimal import Decimal as D

>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]

>>> _sum2(data)

Decimal('0.6963')

Now the second issue:

It is maybe more a matter of taste and concerns the effects of passing a Counter() object to various functions in the module.

I know this is undocumented and it's probably the user's fault if he tries that, but still:

>>> from collections import Counter

>>> c=Counter((1,1,1,1,2,2,2,2,2,3,3,3,3))

>>> c

Counter({1: 4, 2: 5, 3: 4})

>>> mode(c)

2

Cool, mode knows how to work with Counters (interpreting them as frequency tables)

>>> median(c)

2

Looks good

>>> mean(c)

2.0

Very well

But the truth is that only mode really works as you may think and we were just lucky with the other two:

>>> c=Counter((1,1,2))

>>> mean(c)

1.5

oops

>>> median(c)

1.5

hmm

From a quick look at the code you can see that mode actually converts your input to a Counter behind the scenes anyway, so it has no problem.

mean and median, on the other hand, are simply iterating over their input, so if that input happens to be a mapping, they'll use just the keys.

I think there are two simple ways to avoid this pitfall:

1) add an explicit warning to the docs explaining this behavior or

2) make mean and median do the same magic with Counters as mode does, i.e. make them check for Counter as the input type and deal with it as if it were a frequency table. I'd favor this behavior because it looks like little extra code, but may be very useful in many situations. I'm not quite sure whether maybe even all mappings should be treated that way?

Ok, that's it for now I guess. Opinions anyone?

Best,

Wolfgang