On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman email@example.com wrote:
I don't use sum at all, or at least very rarely, and it still irritates me.
You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but in Python it is 0 + a + b + c. If we had a "join" operator for strings that is different form + - then sure, I would not try to use sum to join strings, but we don't.
I've long believed that + is the wrong operator for concatenating strings, and that & makes a much better operator. We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition.
I have always thought that sum(x) is just a shorthand for reduce(operator.add, x), but again it is not so in Python.
The signature of reduce is:
reduce(...) reduce(function, sequence[, initial]) -> value
so sum() is (at least conceptually) a shorthand for reduce:
def sum(values, initial=0): return reduce(operator.add, values, initial)
but that's an implementation detail, not a language promise, and sum() is free to differ from that simple version. Indeed, even the public interface is different, since sum() prohibits using a string as the initial value and only promises to work with numbers. The fact that it happens to work with lists and tuples is somewhat of an accident of implementation.
While "sum should only be used for numbers," it turns out it is not a good choice for floats - use math.fsum.
Correct. And if you (generic you, not you personally) do not understand why simple-minded addition of floats is troublesome, then you're going to have a world of trouble. Anyone who is disturbed by the question of "should I use sum or math.fsum?" probably shouldn't be writing serious floating point code at all. Floating point computations are hard, and there is simply no escaping this fact.
While "strings are blocked because sum is slow," numpy arrays with millions of elements are not.
That's not a good example. Strings are potentially O(N**2), which means not just "slow" but *agonisingly* slow, as in taking a week -- no exaggeration -- to concat a million strings. If it takes a nanosecond to concat two strings, then 1e6**2 such concatenations could take over eleven days. Slowness of such magnitude might as well be "the process has locked up".
In comparison, summing a numpy array with a million entries is not really slow in that sense. The time taken is proportional to the number of entries, and differs from summing a list only by a constant factor.
Besides, in the case of strings it is quite simple to decide "is the initial value a string?", whereas with lists or numpy arrays it's quite hard to decide "is the list or array so huge that the user will consider this too slow?". What counts as "too slow" depends on the machine it is running on, what other processes are running, and the user's mood, and leads to the silly result that summing an array of N items succeeds but N+1 items doesn't. So in the case of strings, it is easy to make a blanket prohibition, but in the case of lists or arrays, there is no reasonable place to draw the line.
And try to explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.
I think that's because sum() has to box up each and every element in the array into an object, which is wasteful, while abs() can delegate to a specialist array.__abs__ method. Although that's not something beginners should be expected to understand, no serious Python programmer should be confused by this. As a programmer, we should expect to have some understanding of our tools, how they work, their limitations, and when to use a different tool. That's why numpy has its own version of sum which is designed to work specifically on numpy arrays. Use a specialist tool for a specialist job:
py> with Stopwatch(): ... sum(carray) # carray is a numpy array of 75000000 floats. ... 112500000.0 time taken: 52.659770 seconds py> with Stopwatch(): ... numpy.sum(carray) ... 112500000.0 time taken: 0.161263 seconds
Why have builtin sum at all if its use comes with so many caveats?
Because sum() is a perfectly reasonable general purpose tool for adding up small amounts of numbers where high floating point precision is not required. It has been included as a built-in because Python comes with "batteries included", and a basic function for adding up a few numbers is an obvious, simple battery. But serious programmers should be comfortable with the idea that you use the right tool for the right job.
If you visit a hardware store, you will find that even something as simple as the hammer exists in many specialist varieties. There are tack hammers, claw hammers, framing hammers, lump hammers, rubber and wooden mallets, "brass" non-sparking hammers, carpet hammers, brick hammers, ball-peen and cross-peen hammers, and even more specialist versions like geologist's hammers. Bashing an object with something hard is remarkably complicated, and there are literally dozens of types and sizes of "the hammer". Why should it be a surprise that there are a handful of different ways to sum items?