[Python-Dev] sum(...) limitation
steve at pearwood.info
Sat Aug 9 07:08:45 CEST 2014
On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
> On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
> > I don't use sum at all, or at least very rarely, and it still irritates me.
> You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but
> in Python it is 0 + a + b + c. If we had a "join" operator for strings
> that is different form + - then sure, I would not try to use sum to join
> strings, but we don't.
I've long believed that + is the wrong operator for concatenating
strings, and that & makes a much better operator. We wouldn't be having
these interminable arguments about using sum() to concatenate strings
(and lists, and tuples) if the & operator was used for concatenation and
+ was only used for numeric addition.
> I have always thought that sum(x) is just a
> shorthand for reduce(operator.add, x), but again it is not so in Python.
The signature of reduce is:
reduce(function, sequence[, initial]) -> value
so sum() is (at least conceptually) a shorthand for reduce:
def sum(values, initial=0):
return reduce(operator.add, values, initial)
but that's an implementation detail, not a language promise, and sum()
is free to differ from that simple version. Indeed, even the public
interface is different, since sum() prohibits using a string as the
initial value and only promises to work with numbers. The fact that it
happens to work with lists and tuples is somewhat of an accident of
> While "sum should only be used for numbers," it turns out it is not a
> good choice for floats - use math.fsum.
Correct. And if you (generic you, not you personally) do not understand
why simple-minded addition of floats is troublesome, then you're going
to have a world of trouble. Anyone who is disturbed by the question of
"should I use sum or math.fsum?" probably shouldn't be writing serious
floating point code at all. Floating point computations are hard, and
there is simply no escaping this fact.
> While "strings are blocked because
> sum is slow," numpy arrays with millions of elements are not.
That's not a good example. Strings are potentially O(N**2), which means
not just "slow" but *agonisingly* slow, as in taking a week -- no
exaggeration -- to concat a million strings. If it takes a nanosecond to
concat two strings, then 1e6**2 such concatenations could take over
eleven days. Slowness of such magnitude might as well be "the process
has locked up".
In comparison, summing a numpy array with a million entries is not
really slow in that sense. The time taken is proportional to the number
of entries, and differs from summing a list only by a constant factor.
Besides, in the case of strings it is quite simple to decide "is the
initial value a string?", whereas with lists or numpy arrays it's quite
hard to decide "is the list or array so huge that the user will consider
this too slow?". What counts as "too slow" depends on the machine it is
running on, what other processes are running, and the user's mood, and
leads to the silly result that summing an array of N items succeeds but
N+1 items doesn't. So in the case of strings, it is easy to make a
blanket prohibition, but in the case of lists or arrays, there is no
reasonable place to draw the line.
> And try to
> explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.
I think that's because sum() has to box up each and every element in the
array into an object, which is wasteful, while abs() can delegate to a
specialist array.__abs__ method. Although that's not something beginners
should be expected to understand, no serious Python programmer should be
confused by this. As a programmer, we should expect to have some
understanding of our tools, how they work, their limitations, and when
to use a different tool. That's why numpy has its own version of sum
which is designed to work specifically on numpy arrays. Use a specialist
tool for a specialist job:
py> with Stopwatch():
... sum(carray) # carray is a numpy array of 75000000 floats.
time taken: 52.659770 seconds
py> with Stopwatch():
time taken: 0.161263 seconds
> Why have builtin sum at all if its use comes with so many caveats?
Because sum() is a perfectly reasonable general purpose tool for adding
up small amounts of numbers where high floating point precision is not
required. It has been included as a built-in because Python comes with
"batteries included", and a basic function for adding up a few numbers
is an obvious, simple battery. But serious programmers should be
comfortable with the idea that you use the right tool for the right job.
If you visit a hardware store, you will find that even something as
simple as the hammer exists in many specialist varieties. There are tack
hammers, claw hammers, framing hammers, lump hammers, rubber and wooden
mallets, "brass" non-sparking hammers, carpet hammers, brick hammers,
ball-peen and cross-peen hammers, and even more specialist versions like
geologist's hammers. Bashing an object with something hard is remarkably
complicated, and there are literally dozens of types and sizes of "the
hammer". Why should it be a surprise that there are a handful of
different ways to sum items?
More information about the Python-Dev