
Steven D'Aprano <steve@...> writes:
On Thu, Jan 30, 2014 at 11:03:38AM -0800, Larry Hastings wrote:
On Mon, Jan 27, 2014 at 9:41 AM, Wolfgang <wolfgang.maier@... <mailto:wolfgang.maier@...>> wrote:
I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way.
I'm willing to consider this a "bug fix". And since it's a new function in 3.4, we don't have an installed base. So I'm willing to consider fixing this for 3.4.
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't:
py> class A(int): ... def __add__(self, other): ... return type(self)(super().__add__(other)) ... def __repr__(self): ... return "%s(%d)" % (type(self).__name__, self) ... py> class B(A): ... pass ... py> A(1) + B(1) A(2) py> B(1) + A(1) B(2)
Hi Steven, first of all let me say that I am quite amazed by the extent of the discussion that is going on now. All I really meant is that there are two special cases (mixed types in _sum and Counters as input to some functions) that I find worth reconsidering in an otherwise really useful module. Regarding your comments above and in other posts: I never proposed two passes over the data. My implementation (below again because many people seem to have missed it in my first rather long post) gathers the input types in a set **while** calculating the sum in a single for loop. It then calls _coerce_types passing this set only once: def _sum2(data, start=None): if start is not None: t = set((type(start),)) n, d = _exact_ratio(start) else: t = set() n = 0 d = 1 partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and build up a set of all types. for x in data: t.add(type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n T = _coerce_types(t) # decide which type to use based on set of all types if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total) As for my tentative implementation of _coerce_types, it was really meant as an example. Specifically, I said:
Personally, I'd prefer something as simple as possible, maybe even:
def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float
, but that's just a suggestion.
It is totally up to you to come up with something more along the lines of your original, but I still think that making the behavior order-independent comes at no performance-cost (if not a gain) and will make _sum's return type more predictable. When I said the current behavior was confusing, I didn't mean "not logical" or anything. The current rules are in fact very precisely worked out, I just think they are too complicated to think them through every time. You are right of course with your remark that addition in Python is also order-dependent regarding the returned type, but in my opinion this is not the point here. You are emphasizing that _sum is a private function of the module, but mean is a public one and the behavior of mean is dictated by that of _sum. Now when I call the mean function, then, of course, I know that this will very most likely be implemented as adding all values then dividing by their number, but in terms of encapsulation principles I shouldn't be forced to think about this to understand the return value of the function. In other words, it doesn't help here that _sum reflects the behavior of __add__, all you should care about is that the behavior of mean() is simple to explain and understand. Again, this is just an opinion of somebody interested in having this particular module well-designed from the beginning before things are set in stone. Best wishes, Wolfgang