[Python-ideas] sum(...) limitation

Wed Aug 13 18:37:30 CEST 2014

I'm going to cut straight to the chase here because this thread, and its 
related ones, on Python-Dev are giving me a headache and overloading my 
inbox. So I'm going to make a probably futile :-( attempt to cut off 
yet another huge thread before it starts by explaining why I think sum() 
ought to stay exactly as it is.

Built-in sum() is already quite complex. It has a fast path and a slow 
path, and that's just for numbers. While it's tempting to imagine sum() 
being even more clever and able to handle more cases, that increases the 
complexity and makes it more likely to end up slower rather than faster, 
or buggy, or both. Better to let the caller choose a specialist function 
(like numpy.array.sum, or math.fsum, or str.join) that handles the 
caller's specialist needs, than to try to make sum() master of 
everything. The more special cases sum() has, the more the pressure to 
add even more.

In the statistics module, I have a private _sum() function which tried 
really hard to deal with high-accuracy sums of mixed arbitrary numeric 
types without compromising too badly on speed, and it's much harder than 
it seems. Trying to handle non-numeric types too increases the 
complexity significantly. If you're smarter than me (I expect that many 
of you probably are) and believe that you can write a version of sum() 
which:

(1) is fast
(2) has better than O(N**2) performance
(3) is correct
(4) is accurate
(5) handles INFs and NANs (for those types which have them)
(6) handles mixed types (for those types which allow mixing)
(7) honours subclasses with custom __add__ and __radd__ methods
(8) and keeps the expected semantics that sum() is like repeated 
    addition (or concatenation)

and does so for *both* numeric and non-numeric cases (like strings, 
bytes, tuples, lists), then PLEASE write some code and publish it. I for 
one would love to see it or use it for the statistics module. But until 
you have tried writing such a thing, whether in C or Python, I think 
you're probably underestimating how hard it is and how fragile the 
result will be.

So, a plea: please stop trying to overloaded poor ol' built-in sum. 
sum() is *not* the One Obvious Way to add arbitrary objects in every 
domain. sum() is intended for simple cases of adding numbers, it is not 
intended as a specialist summation function for everything under the sun 
that can be added or concatenated.

A bit of history, as I remember it: sum() exists because for half of 
Python's lifetime, people were regularly defining this:

    def sum(numbers):
        return reduce(lambda a, b: a+b, numbers)

so they could easy add up a bunch of numbers:

    num_pages = sum([ch.pages() for ch in self.chapters])

sort of thing. Since this was a common need, it was decided to add it to 
the built-ins. But sum() was never intended to replace str.join or
list.extend, let alone even more exotic cases.

Built-in sum is aimed at sequences of numbers, not strings, lists, 
tuples, or Widgets for that matter. Perhaps giving it a start parameter 
was a mistake, but it is there and backwards compatibility says it isn't 
going to be removed. But that doesn't mean that the use of sum() on 
arbitrary types ought to be *encouraged*, even if it is *allowed*.

Conceptually, sum() is intended to behave like:

for value in sequence:
    start = start + value

That means calling custom __add__ or __radd__ methods if they exist. It 
also means that sum() cannot delegate to (say) str.join() without 
changing the semantics. Given:

class Special(str):
    def __radd__(self, other):
        print("I'm special!")
        return other + str(self)

s = Special('y')

the sum 'x' + s is *not* the same as ''.join(['x', s]). A similar 
argument applies to list.extend, and the source code in bltinmodule.c 
already makes that point.

Replying to a couple of points from Stephen:

On Wed, Aug 13, 2014 at 03:21:42PM +0900, Stephen J. Turnbull wrote:

>  > sum() can be used for any type that has an __add__ defined.
> 
> I'd like to see that be mutable types with __iadd__.

Surely you don't mean that. That would mean that sum([1, 2, 3]) would no 
longer work, since ints are not mutable types with __iadd__.

[...]
> Summing tuples works (with appropriate start=tuple()).  Haven't
> benchmarked, but I bet that's O(N^2).

Correct: increasing the number of tuples being added by a factor of 10 
requires almost a factor of 100 more time:

py> t = tuple([(i,) for i in range(1000)])
py> with Stopwatch():
...     _ = sum(t, ())
...
time taken: 0.003805 seconds
py> t *= 10
py> with Stopwatch():
...     _ = sum(t, ())
...
time taken: 0.230225 seconds
py> t *= 10
py> with Stopwatch():
...     _ = sum(t, ())
...
time taken: 32.206471 seconds

> My argument is that in practical use sum() is a bad idea, period,
> until you book up on the types and applications where it *does* work.
> N.B. It doesn't even work properly for numbers (inaccurate for floats).

sum() works fine for its intended uses, especially:

- summing ints exactly
- "low precision" sums of floats

I put "low precision" in scare quotes because, for many purposes, that 
precision is plenty high enough. For the use-case of adding together a 
dozen or a hundred positive floats of similar magnitude, sum() is fine. 
It's only advanced and high-precision uses where it falls short.

To put it another way: if you want to add the mass of Jupiter to the 
mass of a flea, you probably want math.fsum(). If you want to add the 
weight of an apple to the weight of a banana, both measured on a typical 
kitchen scale, sum() will do the job perfectly adequately.

>  > while we are at it, having the default sum() for floats be fsum()
>  > would be nice
> 
> How do you propose to implement that, given math.fsum is perfectly
> happy to sum integers?

And you really don't want sum(integers) to convert to float by default:

py> from math import fsum
py> fsum([10**30, 1234])
1e+30
py> sum([10**30, 1234])
1000000000000000000000000001234

Insisting that there ought to be one and only one way to sum up is, in 
my opinion, foolhardy, no matter how attractive it might seem. I believe 
that the right way is what Python already has: a simple sum() for simple 
cases, a few specialist sums (including ''.join) for the most common or 
important specialist cases, and leave the rest to third-party libraries 
or code you write yourself. That way the caller can then decide exactly 
what trade-offs between time, memory, convenience, accuracy and 
behaviour they wish, instead of invariably being surprised or 
disappointed by whatever trade-offs the one Über-sum() made.

-- 
Steven