[Python-ideas] statistics module in Python3.4

Fri Jan 31 09:57:26 CET 2014

Steven D'Aprano <steve at ...> writes:

> 
> On Thu, Jan 30, 2014 at 11:03:38AM -0800, Larry Hastings wrote:
> > On Mon, Jan 27, 2014 at 9:41 AM, Wolfgang 
> > <wolfgang.maier at ... 
> > <mailto:wolfgang.maier at ...>> wrote:
> > >I think a much cleaner (and probably faster) implementation would be 
> > >to gather first all the types in the input sequence, then decide what 
> > >to return in an input order independent way.
> > 
> > I'm willing to consider this a "bug fix".  And since it's a new function 
> > in 3.4, we don't have an installed base.  So I'm willing to consider 
> > fixing this for 3.4.
> 
> I'm hesitant to require two passes over the data in _sum. Some 
> higher-order statistics like variance are currently implemented using 
> two passes, but ultimately I've like to support single-pass algorithms 
> that can operate on large but finite iterators.
> 
> But I will consider it as an option.
> 
> I'm also hesitant to make the promise that _sum will be 
> order-independent. Addition in Python isn't:
> 
> py> class A(int):
> ...     def __add__(self, other):
> ...             return type(self)(super().__add__(other))
> ...     def __repr__(self):
> ...             return "%s(%d)" % (type(self).__name__, self)
> ...
> py> class B(A):
> ...     pass
> ...
> py> A(1) + B(1)
> A(2)
> py> B(1) + A(1)
> B(2)

Hi Steven,
first of all let me say that I am quite amazed by the extent of the
discussion that is going on now. All I really meant is that there are two
special cases (mixed types in _sum and Counters as input to some functions)
that I find worth reconsidering in an otherwise really useful module.

Regarding your comments above and in other posts:
I never proposed two passes over the data. My implementation (below again
because many people seem to have missed it in my first rather long post)
gathers the input types in a set **while** calculating the sum in a single
for loop. It then calls _coerce_types passing this set only once:

def _sum2(data, start=None):
    if start is not None:
        t = set((type(start),))
        n, d = _exact_ratio(start)
    else:
        t = set()
        n = 0
        d = 1
    partials = {d: n}  # map {denominator: sum of numerators}

    # Micro-optimizations.
    exact_ratio = _exact_ratio
    partials_get = partials.get

    # Add numerators for each denominator, and build up a set of all types.
    for x in data:
        t.add(type(x))
        n, d = exact_ratio(x)
        partials[d] = partials_get(d, 0) + n
    T = _coerce_types(t) # decide which type to use based on set of all types
    if None in partials:
        assert issubclass(T, (float, Decimal))
        assert not math.isfinite(partials[None])
        return T(partials[None])
    total = Fraction()
    for d, n in sorted(partials.items()):
        total += Fraction(n, d)
    if issubclass(T, int):
        assert total.denominator == 1
        return T(total.numerator)
    if issubclass(T, Decimal):
        return T(total.numerator)/total.denominator
    return T(total) 

As for my tentative implementation of _coerce_types, it was really meant as
an example. Specifically, I said:

> Personally, I'd prefer something as simple as possible, maybe even:
>
> def _coerce_types (types):
>    if len(types) == 1:
>        return next(iter(types))
>    return float
>
> , but that's just a suggestion.

It is totally up to you to come up with something more along the lines of
your original, but I still think that making the behavior order-independent
comes at no performance-cost (if not a gain) and will make _sum's return
type more predictable. When I said the current behavior was confusing, I
didn't mean "not logical" or anything. The current rules are in fact very
precisely worked out, I just think they are too complicated to think them
through every time.

You are right of course with your remark that addition in Python is also
order-dependent regarding the returned type, but in my opinion this is not
the point here. You are emphasizing that _sum is a private function of the
module, but mean is a public one and the behavior of mean is dictated by
that of _sum. Now when I call the mean function, then, of course, I know
that this will very most likely be implemented as adding all values then
dividing by their number, but in terms of encapsulation principles I
shouldn't be forced to think about this to understand the return value of
the function. In other words, it doesn't help here that _sum reflects the
behavior of __add__, all you should care about is that the behavior of
mean() is simple to explain and understand.

Again, this is just an opinion of somebody interested in having this
particular module well-designed from the beginning before things are set in
stone.

Best wishes,
Wolfgang