Re: [Python-ideas] [Python-Dev] sum(...) limitation
Redirecting to python-ideas, so trimming less than I might. Chris Barker writes:
On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need!
I can't imagine anyone is suggesting that -- disallow it, but don't tell anyone why?
As I said, it's a regression. That's exactly the behavior in Python 3.2.
The only thing that is remotely on the table here is:
1) remove the special case for strings -- buyer beware -- but consistent and less "ugly"
It's only consistent if you believe that Python has strict rules for use of various operators. It doesn't, except as far as they are constrained by precedence. For example, I have an application where I add bytestrings bytewise modulo N <= 256, and concatenate them. In fact I use function call syntax, but the obvious operator syntax is '+' for the bytewise addition, and '*' for the concatenation. It's not in the Zen, but I believe in the maxim "If it's worth doing, it's worth doing well." So for me, 1) is out anyway.
2) add a special case for strings that is fast and efficient -- may be as simple as calling "".join() under the hood --no more code than the exception check.
Sure, but what about all the other immutable containers with __add__ methods? What about mappings with key-wise __add__ methods whose values might be immutable but have __add__ methods? Where do you stop with the special-casing? I consider this far more complex and ugly than the simple "sum() is for numbers" rule (and even that is way too complex considering accuracy of summing floats).
And I doubt anyone really is pushing for anything but (2)
I know that, but I think it's the wrong solution to the problem (which is genuine IMO). The right solution is something generic, possibly a __sum__ method. The question is whether that leads to too much work to be worth it (eg, "homogeneous_iterable").
Because obviously we'd want the attractive nuisance of "if you have __add__, there's a default definition of __sum__"
now I'm confused -- isn't that exactly what we have now?
Yes and my feeling (backed up by arguments that I admit may persuade nobody but myself) is that what we have now kinda sucks[tm]. It seemed like a good idea when I first saw it, but then, my apps don't scale to where the pain starts in my own usage.
It's possible that Python could provide some kind of feature that would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking.
does it need to be every type? As it is the common ones work fine already except for strings -- so if we add an optimized string sum() then we're done.
I didn't say provide an optimized sum(), I said provide a feature enabling people who want to optimize sum() to do so. So yes, it needs to be every type (the optional __sum__ method is a proof of concept, modulo it actually being implementable ;-).
*Somebody* will do it (I don't think anybody is +1 on restricting sum() to a subset of types with __add__).
uhm, that's exactly what we have now
Exactly. Who's arguing that the sum() we have now is a ticket to Paradise? I'm just saying that there's probably somebody out there negative enough on the current situation to come up with an answer that I think is general enough (and I suspect that python-dev consensus is that demanding, too).
sum() can be used for any type that has an __add__ defined.
I'd like to see that be mutable types with __iadd__.
What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue.
Because inefficient sum() is an attractive nuisance, easy to overlook, and likely to bite users other than the author.
The only justification offered here is that will teach people that summing strings (and some other objects?)
Summing tuples works (with appropriate start=tuple()). Haven't benchmarked, but I bet that's O(N^2).
is order(N^2) and a bad idea. But:
a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that)
My argument is that in practical use sum() is a bad idea, period, until you book up on the types and applications where it *does* work. N.B. It doesn't even work properly for numbers (inaccurate for floats).
b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()".
For people who think that special-casing strings is a good idea, I think this is about as much benefit as you can expect. Why go farther?<0.5 wink/>
I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization.
TOOWTDI. str.join is in pretty much every code base by now, and tutorials and FAQs recommending its user and severely deprecating sum for strings are legion.
One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn....
That assumes they know about the start argument. I think most naive users will just try to sum a bunch of tuples, and get the "can't add 0, tuple" Exception and write a loop. I suspect that many of the users who get the "use str.join" warning along with the Exception are unaware of the start argument, too. They expect sum(iter_of_str) to magically add the strings. Ie, when in 3.2 they got the uninformative "can't add 0, str" message, they did not immediately go "d'oh" and insert ", start=''" in the call to sum, they wrote a loop.
while we are at it, having the default sum() for floats be fsum() would be nice
How do you propose to implement that, given math.fsum is perfectly happy to sum integers? You can't just check one or a few leading elements for floatiness. I think you have to dispatch on type(start), but then sum(iter_of_floats) DTWT. So I would suggest changing the signature to sum(it, start=0.0). This would probably be acceptable to most users with iterables of ints, but does imply some performance hit.
This does turn sum() into a function that does type-based dispatch, but isn't python full of those already? do something special for the types you know about, call the generic dunder method for the rest.
AFAIK Python is moving in the opposite direction: if there's a common need for dispatching to type-specific implementations of a method, define a standard (not "generic") dunder for the purpose, and have the builtin (or operator, or whatever) look up (not "call") the appropriate instance in the usual way, then call it. If there's a useful generic implementation, define an ABC to inherit from that provides that generic implementation.
I'm going to cut straight to the chase here because this thread, and its related ones, on Python-Dev are giving me a headache and overloading my inbox. So I'm going to make a probably futile :-( attempt to cut off yet another huge thread before it starts by explaining why I think sum() ought to stay exactly as it is. Built-in sum() is already quite complex. It has a fast path and a slow path, and that's just for numbers. While it's tempting to imagine sum() being even more clever and able to handle more cases, that increases the complexity and makes it more likely to end up slower rather than faster, or buggy, or both. Better to let the caller choose a specialist function (like numpy.array.sum, or math.fsum, or str.join) that handles the caller's specialist needs, than to try to make sum() master of everything. The more special cases sum() has, the more the pressure to add even more. In the statistics module, I have a private _sum() function which tried really hard to deal with high-accuracy sums of mixed arbitrary numeric types without compromising too badly on speed, and it's much harder than it seems. Trying to handle non-numeric types too increases the complexity significantly. If you're smarter than me (I expect that many of you probably are) and believe that you can write a version of sum() which: (1) is fast (2) has better than O(N**2) performance (3) is correct (4) is accurate (5) handles INFs and NANs (for those types which have them) (6) handles mixed types (for those types which allow mixing) (7) honours subclasses with custom __add__ and __radd__ methods (8) and keeps the expected semantics that sum() is like repeated addition (or concatenation) and does so for *both* numeric and non-numeric cases (like strings, bytes, tuples, lists), then PLEASE write some code and publish it. I for one would love to see it or use it for the statistics module. But until you have tried writing such a thing, whether in C or Python, I think you're probably underestimating how hard it is and how fragile the result will be. So, a plea: please stop trying to overloaded poor ol' built-in sum. sum() is *not* the One Obvious Way to add arbitrary objects in every domain. sum() is intended for simple cases of adding numbers, it is not intended as a specialist summation function for everything under the sun that can be added or concatenated. A bit of history, as I remember it: sum() exists because for half of Python's lifetime, people were regularly defining this: def sum(numbers): return reduce(lambda a, b: a+b, numbers) so they could easy add up a bunch of numbers: num_pages = sum([ch.pages() for ch in self.chapters]) sort of thing. Since this was a common need, it was decided to add it to the built-ins. But sum() was never intended to replace str.join or list.extend, let alone even more exotic cases. Built-in sum is aimed at sequences of numbers, not strings, lists, tuples, or Widgets for that matter. Perhaps giving it a start parameter was a mistake, but it is there and backwards compatibility says it isn't going to be removed. But that doesn't mean that the use of sum() on arbitrary types ought to be *encouraged*, even if it is *allowed*. Conceptually, sum() is intended to behave like: for value in sequence: start = start + value That means calling custom __add__ or __radd__ methods if they exist. It also means that sum() cannot delegate to (say) str.join() without changing the semantics. Given: class Special(str): def __radd__(self, other): print("I'm special!") return other + str(self) s = Special('y') the sum 'x' + s is *not* the same as ''.join(['x', s]). A similar argument applies to list.extend, and the source code in bltinmodule.c already makes that point. Replying to a couple of points from Stephen: On Wed, Aug 13, 2014 at 03:21:42PM +0900, Stephen J. Turnbull wrote:
sum() can be used for any type that has an __add__ defined.
I'd like to see that be mutable types with __iadd__.
Surely you don't mean that. That would mean that sum([1, 2, 3]) would no longer work, since ints are not mutable types with __iadd__. [...]
Summing tuples works (with appropriate start=tuple()). Haven't benchmarked, but I bet that's O(N^2).
Correct: increasing the number of tuples being added by a factor of 10 requires almost a factor of 100 more time: py> t = tuple([(i,) for i in range(1000)]) py> with Stopwatch(): ... _ = sum(t, ()) ... time taken: 0.003805 seconds py> t *= 10 py> with Stopwatch(): ... _ = sum(t, ()) ... time taken: 0.230225 seconds py> t *= 10 py> with Stopwatch(): ... _ = sum(t, ()) ... time taken: 32.206471 seconds
My argument is that in practical use sum() is a bad idea, period, until you book up on the types and applications where it *does* work. N.B. It doesn't even work properly for numbers (inaccurate for floats).
sum() works fine for its intended uses, especially: - summing ints exactly - "low precision" sums of floats I put "low precision" in scare quotes because, for many purposes, that precision is plenty high enough. For the use-case of adding together a dozen or a hundred positive floats of similar magnitude, sum() is fine. It's only advanced and high-precision uses where it falls short. To put it another way: if you want to add the mass of Jupiter to the mass of a flea, you probably want math.fsum(). If you want to add the weight of an apple to the weight of a banana, both measured on a typical kitchen scale, sum() will do the job perfectly adequately.
while we are at it, having the default sum() for floats be fsum() would be nice
How do you propose to implement that, given math.fsum is perfectly happy to sum integers?
And you really don't want sum(integers) to convert to float by default: py> from math import fsum py> fsum([10**30, 1234]) 1e+30 py> sum([10**30, 1234]) 1000000000000000000000000001234 Insisting that there ought to be one and only one way to sum up is, in my opinion, foolhardy, no matter how attractive it might seem. I believe that the right way is what Python already has: a simple sum() for simple cases, a few specialist sums (including ''.join) for the most common or important specialist cases, and leave the rest to third-party libraries or code you write yourself. That way the caller can then decide exactly what trade-offs between time, memory, convenience, accuracy and behaviour they wish, instead of invariably being surprised or disappointed by whatever trade-offs the one Über-sum() made. -- Steven
On Wed, Aug 13, 2014 at 9:37 AM, Steven D'Aprano <steve@pearwood.info> wrote:
In the statistics module, I have a private _sum() function which tried really hard to deal with high-accuracy sums of mixed arbitrary numeric types without compromising too badly on speed, and it's much harder than it seems. Trying to handle non-numeric types too increases the complexity significantly. If you're smarter than me (I expect that many of you probably are) and believe that you can write a version of sum() which:
(1) is fast (2) has better than O(N**2) performance (3) is correct (4) is accurate (5) handles INFs and NANs (for those types which have them) (6) handles mixed types (for those types which allow mixing) (7) honours subclasses with custom __add__ and __radd__ methods (8) and keeps the expected semantics that sum() is like repeated addition (or concatenation)
and does so for *both* numeric and non-numeric cases (like strings, bytes, tuples, lists), then PLEASE write some code and publish it. I for one would love to see it or use it for the statistics module. But until you have tried writing such a thing, whether in C or Python, I think you're probably underestimating how hard it is and how fragile the result will be.
The basic form I would use for an updated sum, which would often, but not always, perform better would be something like (preferably, written in C, not pure Python): def sum(items, start=0): value = copy.copy(start) # Make sure that start is not mutated. for item in items: value += item return value To avoid needing to access copy.copy, something like the following would also work, but with a bit more local complexity: def sum(items, start=0): count = len(items) if count > 0: value = start + items[0] # Make sure not to mutate start. else return start for i in range(1, count): value += items[i] return value Either of these would likely perform better when summing items which support an optimized __iadd__, which allows many types to perform better. For those which do not support __iadd__, it would fallback to the slower __add__ but still function. Note that the first version may be slower than existing some in some cases, namely if copy.copy(start) is very slow for the type. The second case should never be slower than the existing sum, presuming __iadd__ is not implemented in a slower manner than __add__. Due to the string add optimization CPython, this should make strings sum better than O(N**2), many custom types will also sum faster, and, for all well-behaved types, should produce the same result as the existing sum. Chris
On Wed, Aug 13, 2014 at 09:59:52AM -0700, Chris Kaynor wrote:
The basic form I would use for an updated sum, which would often, but not always, perform better would be something like (preferably, written in C, not pure Python):
def sum(items, start=0): value = copy.copy(start) # Make sure that start is not mutated. for item in items: value += item return value
That's a semantic change from the existing sum: start does not have to be copyable. Currently I can write this: py> class A: ... def __copy__(self): ... raise TypeError("I am unique!") ... py> a = A() py> class B: ... def __radd__(self, other): ... return 23 ... py> b = B() py> sum([b, 1000, 2000], a) 3023 By changing the semantics of sum() to require start to be copyable, your revision has just broken my application.
To avoid needing to access copy.copy, something like the following would also work, but with a bit more local complexity:
def sum(items, start=0): count = len(items)
That's another semantic change: sum() currently accepts iterators (although not *infinite* iterators). This revision has now broken tens of thousands of applications. -- Steven
On Wed, Aug 13, 2014 at 10:46 AM, Steven D'Aprano <steve@pearwood.info> wrote:
That's another semantic change: sum() currently accepts iterators (although not *infinite* iterators). This revision has now broken tens of thousands of applications.
okay, how about: def sum(items, start=0): first = True for item in items: if first: start = start + item else: start += item return start I believe that has the same effect as my other two, but only adds the requirement that "value += item" behaves the same as "value = value + item". Chris
This isn't the first time around in this discussion of (abusing) 'sum()'. Actually, I discussed the last round in my keynote at PyCon UK in 2013. In particular, I actually discussed there already how Chris' latest version of 'sum()' breaks *existing* and *widespread* code. My slides are at: http://gnosis.cx/pycon-uk-2013/Keynote-Ideas.pdf Look at slides 24-26. The TL;DR version: numpy arrays give different meanings to __add__() and __iadd__(). Chris' version breaks every program ever written that uses numpy. On Wed, Aug 13, 2014 at 11:05 AM, Chris Kaynor <ckaynor@zindagigames.com> wrote:
On Wed, Aug 13, 2014 at 10:46 AM, Steven D'Aprano <steve@pearwood.info> wrote:
That's another semantic change: sum() currently accepts iterators (although not *infinite* iterators). This revision has now broken tens of thousands of applications.
okay, how about:
def sum(items, start=0): first = True for item in items: if first: start = start + item else: start += item return start
I believe that has the same effect as my other two, but only adds the requirement that "value += item" behaves the same as "value = value + item".
Chris
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
Le 13/08/2014 12:37, Steven D'Aprano a écrit :
sum() is *not* the One Obvious Way to add arbitrary objects in every domain.
Well, it is. It's a builtin and it's called "sum", which makes it pretty clear it adds objects together. That could hardly be any more obvious, actually: all competitors are *less* obvious. Indeed it cannot *practically* be the most efficient or accurate to sum arbitrary objects, unless we manage to devise a clever protocol (__sum__? how would that work?) that can delegate to arbitrary third-party code. But people are bound to think, legitimately, that built-in sum() is the logical answer when wanting to sum a sequence of objects. (of course, whether or not "sum" is a reasonable notion when applied to strings - rather than lists or floats - is a separate question) Regards Antoine.
On Wed, Aug 13, 2014 at 2:03 PM, Antoine Pitrou <antoine@python.org> wrote:
Le 13/08/2014 12:37, Steven D'Aprano a écrit :
sum() is *not* the One Obvious Way to add arbitrary objects in every domain.
Well, it is. It's a builtin and it's called "sum", which makes it pretty clear it adds objects together. That could hardly be any more obvious, actually: all competitors are *less* obvious.
When it comes to "adding" sequences or containers, "concat" would be a strong competitor. After all, this is exactly what + does on sequences under the hood. [1] While there are technical reasons for reusing the same infix operator for adding and concatenation, we are not so restricted when it comes to builtins. [1] https://docs.python.org/2/c-api/sequence.html#PySequence_Concat
On Aug 13, 2014, at 11:03, Antoine Pitrou <antoine@python.org> wrote:
Well, it is. It's a builtin and it's called "sum", which makes it pretty clear it adds objects together. That could hardly be any more obvious, actually: all competitors are *less* obvious.
Have you ever heard of anyone wanting to "sum" a bunch of pieces of text? Or a bunch of lists? Actually, now that I think about it, although I was being somewhat flippant, I've heard both of those things plenty of times, and never once has it meant concatenation. It always means literal or figurative element-wise summing, or some other kind of aggregation. Which makes it even _less_ obvious as a one true way to concatenate a bunch of things. Also, how often do you really want to concatenate a bunch of lists into a bigger list, as opposed to just chaining them together into a new iterable? It's not _never_, but I think it's less common. So if itertools.chain.from_iterable doesn't deserve to be a builtin (or even a top-level module function), is concat, a modified sum, or some other way to spell the same idea really necessary?
Have you ever heard someone wanting to "add" a bunch of pieces of text? Outside of programming, of course not. In our domain, adding strings makes perfect sense (it's concatenation), thus the idea of summing strings makes perfect sense, given that summation is defined as repeated addition. The same argument applies for lists.
On 14 August 2014 02:37, Steven D'Aprano <steve@pearwood.info> wrote:
A bit of history, as I remember it: sum() exists because for half of Python's lifetime, people were regularly defining this:
def sum(numbers): return reduce(lambda a, b: a+b, numbers)
so they could easy add up a bunch of numbers:
num_pages = sum([ch.pages() for ch in self.chapters])
sort of thing. Since this was a common need, it was decided to add it to the built-ins. But sum() was never intended to replace str.join or list.extend, let alone even more exotic cases.
Built-in sum is aimed at sequences of numbers, not strings, lists, tuples, or Widgets for that matter. Perhaps giving it a start parameter was a mistake, but it is there and backwards compatibility says it isn't going to be removed. But that doesn't mean that the use of sum() on arbitrary types ought to be *encouraged*, even if it is *allowed*.
Interesting point of history: Adding the sum() builtin is when Alex Martelli was given commit privileges (see http://bugs.python.org/issue724936 - found by putting "builtin sum" into the search field on bugs.python.org) Reviewing that issues, in one of the draft patches, all sequences were banned, but that was changed in order to allow sum() to handle sequences that define __add__ as an element-wise operation (since those won't exhibit the quadratic behaviour). The date of the issue also made it possible to find the thread Alex mentions in the list archives: https://mail.python.org/pipermail/python-dev/2003-April/034767.html And there we find that the original patch *did* automatically delegate to str.join, and after an intial review that expressed some reservations (https://mail.python.org/pipermail/python-dev/2003-April/034825.html), Guido specifically requested changing it to disallow strings: https://mail.python.org/pipermail/python-dev/2003-April/034853.html And Alex agreed: https://mail.python.org/pipermail/python-dev/2003-April/034855.html The idea of a join() builtin was also discussed and rejected in that thread. Regards, Nick. P.S. Note that this kind of thread (which I contributed to myself) is why I'm inclined to insist on a PEP for *any* new builtin these days, even non-controversial ones: so the rationale for the associated design decision is less likely to be rehashed at length more than a decade later. Remembering "there was a PEP for that feature, and it listed this as a rejected design alternative" is relatively easy. Remembering "that idea came up in that thread back in 2003" is much, much, harder :P -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (9)
-
Alexander Belopolsky -
Andrew Barnert -
Antoine Pitrou -
Chris Kaynor -
David Mertz -
Mark Young -
Nick Coghlan -
Stephen J. Turnbull -
Steven D'Aprano