[Python-ideas] sum(...) limitation
Steven D'Aprano
steve at pearwood.info
Wed Aug 13 18:37:30 CEST 2014
I'm going to cut straight to the chase here because this thread, and its
related ones, on Python-Dev are giving me a headache and overloading my
inbox. So I'm going to make a probably futile :-( attempt to cut off
yet another huge thread before it starts by explaining why I think sum()
ought to stay exactly as it is.
Built-in sum() is already quite complex. It has a fast path and a slow
path, and that's just for numbers. While it's tempting to imagine sum()
being even more clever and able to handle more cases, that increases the
complexity and makes it more likely to end up slower rather than faster,
or buggy, or both. Better to let the caller choose a specialist function
(like numpy.array.sum, or math.fsum, or str.join) that handles the
caller's specialist needs, than to try to make sum() master of
everything. The more special cases sum() has, the more the pressure to
add even more.
In the statistics module, I have a private _sum() function which tried
really hard to deal with high-accuracy sums of mixed arbitrary numeric
types without compromising too badly on speed, and it's much harder than
it seems. Trying to handle non-numeric types too increases the
complexity significantly. If you're smarter than me (I expect that many
of you probably are) and believe that you can write a version of sum()
which:
(1) is fast
(2) has better than O(N**2) performance
(3) is correct
(4) is accurate
(5) handles INFs and NANs (for those types which have them)
(6) handles mixed types (for those types which allow mixing)
(7) honours subclasses with custom __add__ and __radd__ methods
(8) and keeps the expected semantics that sum() is like repeated
addition (or concatenation)
and does so for *both* numeric and non-numeric cases (like strings,
bytes, tuples, lists), then PLEASE write some code and publish it. I for
one would love to see it or use it for the statistics module. But until
you have tried writing such a thing, whether in C or Python, I think
you're probably underestimating how hard it is and how fragile the
result will be.
So, a plea: please stop trying to overloaded poor ol' built-in sum.
sum() is *not* the One Obvious Way to add arbitrary objects in every
domain. sum() is intended for simple cases of adding numbers, it is not
intended as a specialist summation function for everything under the sun
that can be added or concatenated.
A bit of history, as I remember it: sum() exists because for half of
Python's lifetime, people were regularly defining this:
def sum(numbers):
return reduce(lambda a, b: a+b, numbers)
so they could easy add up a bunch of numbers:
num_pages = sum([ch.pages() for ch in self.chapters])
sort of thing. Since this was a common need, it was decided to add it to
the built-ins. But sum() was never intended to replace str.join or
list.extend, let alone even more exotic cases.
Built-in sum is aimed at sequences of numbers, not strings, lists,
tuples, or Widgets for that matter. Perhaps giving it a start parameter
was a mistake, but it is there and backwards compatibility says it isn't
going to be removed. But that doesn't mean that the use of sum() on
arbitrary types ought to be *encouraged*, even if it is *allowed*.
Conceptually, sum() is intended to behave like:
for value in sequence:
start = start + value
That means calling custom __add__ or __radd__ methods if they exist. It
also means that sum() cannot delegate to (say) str.join() without
changing the semantics. Given:
class Special(str):
def __radd__(self, other):
print("I'm special!")
return other + str(self)
s = Special('y')
the sum 'x' + s is *not* the same as ''.join(['x', s]). A similar
argument applies to list.extend, and the source code in bltinmodule.c
already makes that point.
Replying to a couple of points from Stephen:
On Wed, Aug 13, 2014 at 03:21:42PM +0900, Stephen J. Turnbull wrote:
> > sum() can be used for any type that has an __add__ defined.
>
> I'd like to see that be mutable types with __iadd__.
Surely you don't mean that. That would mean that sum([1, 2, 3]) would no
longer work, since ints are not mutable types with __iadd__.
[...]
> Summing tuples works (with appropriate start=tuple()). Haven't
> benchmarked, but I bet that's O(N^2).
Correct: increasing the number of tuples being added by a factor of 10
requires almost a factor of 100 more time:
py> t = tuple([(i,) for i in range(1000)])
py> with Stopwatch():
... _ = sum(t, ())
...
time taken: 0.003805 seconds
py> t *= 10
py> with Stopwatch():
... _ = sum(t, ())
...
time taken: 0.230225 seconds
py> t *= 10
py> with Stopwatch():
... _ = sum(t, ())
...
time taken: 32.206471 seconds
> My argument is that in practical use sum() is a bad idea, period,
> until you book up on the types and applications where it *does* work.
> N.B. It doesn't even work properly for numbers (inaccurate for floats).
sum() works fine for its intended uses, especially:
- summing ints exactly
- "low precision" sums of floats
I put "low precision" in scare quotes because, for many purposes, that
precision is plenty high enough. For the use-case of adding together a
dozen or a hundred positive floats of similar magnitude, sum() is fine.
It's only advanced and high-precision uses where it falls short.
To put it another way: if you want to add the mass of Jupiter to the
mass of a flea, you probably want math.fsum(). If you want to add the
weight of an apple to the weight of a banana, both measured on a typical
kitchen scale, sum() will do the job perfectly adequately.
> > while we are at it, having the default sum() for floats be fsum()
> > would be nice
>
> How do you propose to implement that, given math.fsum is perfectly
> happy to sum integers?
And you really don't want sum(integers) to convert to float by default:
py> from math import fsum
py> fsum([10**30, 1234])
1e+30
py> sum([10**30, 1234])
1000000000000000000000000001234
Insisting that there ought to be one and only one way to sum up is, in
my opinion, foolhardy, no matter how attractive it might seem. I believe
that the right way is what Python already has: a simple sum() for simple
cases, a few specialist sums (including ''.join) for the most common or
important specialist cases, and leave the rest to third-party libraries
or code you write yourself. That way the caller can then decide exactly
what trade-offs between time, memory, convenience, accuracy and
behaviour they wish, instead of invariably being surprised or
disappointed by whatever trade-offs the one Über-sum() made.
--
Steven
More information about the Python-ideas
mailing list