Re: [Python-Dev] sum(...) limitation
Glenn Linderman writes:
On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics).
Actually, there is no need to wait for 0.sum() to propose "".sum... but it is only a spelling change, so no real benefit.
IMO it's worse than merely a spelling change, because (1) "join" is a more evocative term for concatenating strings than "sum" and (2) I don't know of any other sums that allow "glue". I'm overall -1 on trying to change the current situation (except for adding a join() builtin or str.join class method). We could probably fix everything in a static-typed language (because that would allow picking an initial object of the appropriate type), but without that we need to pick a default of some particular type, and 0 makes the most sense. I can understand the desire of people who want to use the same syntax for summing an iterable of numbers and for concatenating an iterable of strings, but to me they're really not even formally the same in practical use. I'm very sympathetic to Steven's explanation that "we wouldn't be having this discussion if we used a different operator for string concatenation". Although that's not the whole story: in practice even numerical sums get split into multiple functions because floating point addition isn't associative, and so needs careful treatment to preserve accuracy. At that point I'm strongly +1 on abandoning attempts to "rationalize" summation. I'm not sure how I'd feel about raising an exception if you try to sum any iterable containing misbehaved types like float. But not only would that be a Python 4 effort due to backward incompatibility, but it sorta contradicts the main argument of proponents ("any type implementing __add__ should be sum()-able").
It seems to me this is something of a pointless discussion -- I highly doubt the current situation is going to change, and it works very well. Even if not perfect, sum() is for numbers, sep.join() for strings. However, I will add one comment: I'm overall -1 on trying to change the current situation (except for
adding a join() builtin or str.join class method).
Did you know there actually is a str.join "class method"? I've never actually seen it used this way, but for people who just can't stand sep.join(seq), you can always call str.join(sep, seq) -- works in Python 2 and 3:
str.join('.', ['abc', 'def', 'ghi']) 'abc.def.ghi'
This works as a side effect of the fact that you can call methods as cls.method(instance, args). -Ben
On 8/11/2014 8:26 AM, Ben Hoyt wrote:
It seems to me this is something of a pointless discussion -- I highly doubt the current situation is going to change, and it works very well. Even if not perfect, sum() is for numbers, sep.join() for strings. However, I will add one comment:
I'm overall -1 on trying to change the current situation (except for adding a join() builtin or str.join class method).
Did you know there actually is a str.join "class method"?
A 'method' is a function accessed as an attribute of a class. An 'instance method' is a method whose first parameter is an instance of the class. str.join is an instance method. A 'class method', wrapped as such with classmether(), usually by decorating it with @classmethod, would take the class as a parameter.
I've never actually seen it used this way, but for people who just can't stand sep.join(seq), you can always call str.join(sep, seq) -- works in Python 2 and 3:
str.join('.', ['abc', 'def', 'ghi']) 'abc.def.ghi'
One could even put 'join = str.join' at the top of a file. All this is true of *every* instance method. For instance
int.__add__(1, 2) == 1 .__add__(2) == 1 + 2 True
However, your point that people who cannot stand the abbreviation *could* use the full form that is being abbreviated. In ancient Python, when strings did not have methods, the current string methods were functions in the string module. The functions were removed in 3.0. Their continued use in 2.x code is bad for 3.x compatibility, so I would not encourage it.
help(string.join) # 2.7.8 Help on function join in module string:
join(words, sep=' ') join(list [,sep]) -> string Return a string composed of the words in list, with intervening occurrences of sep. The default separator is a single space. 'List' is obsolete. Since sometime before 2.7, 'words' meant an iterable of strings.
def digits(): for i in range(10): yield str(i)
string.join(digits(), '') '0123456789'
Of of the string functions, I believe the conversion of join (and its synonum 'joinfields') to a method has been the most contentious. -- Terry Jan Reedy
I'm very sympathetic to Steven's explanation that "we wouldn't be having this discussion if we used a different operator for string concatenation".
Sure -- but just imagine the conversations we could be having instead : what does bit wise and of a string mean? A bytes object? I cod see it as a character-wise and, for instance ;-) My confusion is still this: Repeated summation of strings has been optimized in cpython even though it's not the recommended way to solve that problem. So why not special case optimize sum() for strings? We are already special-case strings to raise an exception. It seems pretty pedantic to say: we cod make this work well, but we'd rather chide you for not knowing the "proper" way to do it. Practicality beats purity? -Chris
Although that's not the whole story: in practice even numerical sums get split into multiple functions because floating point addition isn't associative, and so needs careful treatment to preserve accuracy. At that point I'm strongly +1 on abandoning attempts to "rationalize" summation.
I'm not sure how I'd feel about raising an exception if you try to sum any iterable containing misbehaved types like float. But not only would that be a Python 4 effort due to backward incompatibility, but it sorta contradicts the main argument of proponents ("any type implementing __add__ should be sum()-able").
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
On 12 Aug 2014 03:03, "Chris Barker - NOAA Federal" <chris.barker@noaa.gov> wrote:
My confusion is still this:
Repeated summation of strings has been optimized in cpython even though it's not the recommended way to solve that problem.
The quadratic behaviour of repeated str summation is a subtle, silent error. It *is* controversial that CPython silently optimises some cases of it away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick. It's considered worth the cost, since it dramatically improves the performance of common naive code in a way that doesn't alter the semantics.
So why not special case optimize sum() for strings? We are already special-case strings to raise an exception.
It seems pretty pedantic to say: we cod make this work well, but we'd rather chide you for not knowing the "proper" way to do it.
Yes, that's exactly what this is - a nudge towards the right way to concatenate strings without incurring quadratic behaviour. We *want* people to learn that distinction, not sweep it under the rug. That's the other reason the implicit optimisation is controversial - it hides an important difference in algorithmic complexity from users.
Practicality beats purity?
Teaching users the difference between linear time operations and quadratic ones isn't about purity, it's about passing along a fundamental principle of algorithm scalability. We do it specifically for strings because they *do* have an optimised algorithm available that we can point users towards, and concatenating multiple strings is common. Other containers don't tend to be concatenated like that in the first place, so there's no such check pushing other iterables towards itertools.chain. Regards, Nick.
-Chris
Although that's not the whole story: in practice even numerical sums get split into multiple functions because floating point addition isn't associative, and so needs careful treatment to preserve accuracy. At that point I'm strongly +1 on abandoning attempts to "rationalize" summation.
I'm not sure how I'd feel about raising an exception if you try to sum any iterable containing misbehaved types like float. But not only would that be a Python 4 effort due to backward incompatibility, but it sorta contradicts the main argument of proponents ("any type implementing __add__ should be sum()-able").
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Mon, Aug 11, 2014 at 8:19 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Teaching users the difference between linear time operations and quadratic ones isn't about purity, it's about passing along a fundamental principle of algorithm scalability.
I would understand if this was done in reduce(operator.add, ..) which indeed spells out the choice of an algorithm, but why sum() should be O(N) for numbers and O(N**2) for containers? Would a python implementation that, for example, optimizes away 0's in sum(list_of_numbers) be non-compliant with some fundamental principle?
Sorry for the bike shedding here, but: The quadratic behaviour of repeated str summation is a subtle, silent error. OK, fair enough. I suppose it would be hard and ugly to catch those instances and raise an exception pointing users to "".join. *is* controversial that CPython silently optimises some cases of it away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick. Is there anything in the language spec that says string concatenation is O(n^2)? Or for that matter any of the performs characteristics of build in types? Those striker as implementation details that SHOULD be particular to the implementation. Should we cripple the performance of some operation in Cpython so that it won't work better that Jython? That seems an odd choice. Then how dare PyPy make scalar computation faster? People might switch to cPython and not know they should have been using numpy all along... It's considered worth the cost, since it dramatically improves the performance of common naive code in a way that doesn't alter the semantics. Seems the same argument could be made for sum(list_of_strings).
It seems pretty pedantic to say: we could make this work well, but we'd rather chide you for not knowing the "proper" way to do it.
Yes, that's exactly what this is - a nudge towards the right way to concatenate strings without incurring quadratic behaviour. But if it were optimized, it wouldn't incur quadratic behavior. We *want* people to learn that distinction, not sweep it under the rug. But sum() is not inherently quadratic -- that's a limitation of the implementation. I agree that disallowing it is a good idea given that behavior, but if it were optimized, there would be no reason to steer people away. "".join _could_ be naively written with the same poor performance -- why should users need to understand why one was optimized and one was not? That's the other reason the implicit optimisation is controversial - it hides an important difference in algorithmic complexity from users. It doesn't hide it -- it eliminates it. I suppose it's good for folks to understand the implications of string immutability for when they write their own algorithms, but this wouldn't be considered a good argument for a poorly performing sort() for instance.
Practicality beats purity?
Teaching users the difference between linear time operations and quadratic ones isn't about purity, it's about passing along a fundamental principle of algorithm scalability. That is a very import a lesson to learn, sure, but python is not only a teaching language. People will need to learn those lessons at some point, this one feature makes little difference. We do it specifically for strings because they *do* have an optimised algorithm available that we can point users towards, and concatenating multiple strings is common. Sure, but I think all that does is teach people about a cpython specific implementation -- and I doubt naive users get any closer to understanding algorithmic complexity -- all they learn is you should use string.join(). Oh well, not really that big a deal. -Chris
Chris Barker - NOAA Federal writes:
Is there anything in the language spec that says string concatenation is O(n^2)? Or for that matter any of the performs characteristics of build in types? Those striker as implementation details that SHOULD be particular to the implementation.
Container concatenation isn't quadratic in Python at all. The naive implementation of sum() as a loop repeatedly calling __add__ is quadratic for them. Strings (and immutable containers in general) are particularly horrible, as they don't have __iadd__. You could argue that sum() being a function of an iterable isn't just a calling convention for a loop encapsulated in a function, but rather a completely different kind of function that doesn't imply anything about the implementation, and therefore that it should dispatch on type(it). But explicitly dispatching on type(x) is yucky (what if somebody wants to sum a different type not currently recognized by the sum() builtin?) so, obviously, we should define a standard __sum__ dunder! IMO we'd also want a homogeneous_iterable ABC, and a concrete homogeneous_iterable_of_TYPE for each sum()-able TYPE to help users catch bugs injecting the wrong type into an iterable_of_TYPE. But this still sucks. Why? Because obviously we'd want the attractive nuisance of "if you have __add__, there's a default definition of __sum__" (AIUI, this is what bothers Alexander most about the current situation, at least of the things he's mentioned, I can really sympathize with his dislike). And new Pythonistas and lazy programmers who only intend to use sum() on "small enough" iterables will use the default, and their programs will appear to hang on somewhat larger iterable, or a realtime requirement will go unsatisfied when least expected, or .... If we *don't* have that property for sum(), ugh! Yuck! Same old same old! (IMHO, YMMV of course) It's possible that Python could provide some kind of feature that would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking. *Somebody* will do it (I don't think anybody is +1 on restricting sum() to a subset of types with __add__). I just think we should wait until that somebody appears.
Should we cripple the performance of some operation in Cpython so that it won't work better that Jython?
Nobody is crippling operations. We're prohibiting use of a *name* for an operation that is associated (strongly so, in my mind) with an inefficient algorithm in favor of the *same operation* by a different name (which has no existing implementation, and therefore Python implementers are responsible for implementing it efficiently). Note: the "inefficient" algorithm isn't inefficient for integers, and it isn't inefficient for numbers in general (although it's inaccurate for some classes of numbers).
Seems the same argument [that Python language doesn't prohibit optimizations in particular implementations just because they aren't made in others] could be made for sum(list_of_strings).
It could. But then we have to consider special-casing every builtin type that provides __add__, and we impose an unobvious burden on user types that provide __add__.
It seems pretty pedantic to say: we could make this work well, but we'd rather chide you for not knowing the "proper" way to do it.
Nobody disagrees. But backward compatibility gets in the way.
But sum() is not inherently quadratic -- that's a limitation of the implementation.
But the faulty implementation is the canonical implementation, the only one that can be defined directly in terms of __add__, and it is efficient for non-container types.[1]
"".join _could_ be naively written with the same poor performance -- why should users need to understand why one was optimized and one was not?
Good question. They shouldn't -- thus the prohibition on sum()ing strings.
That is a very import a lesson to learn, sure, but python is not only a teaching language. People will need to learn those lessons at some point, this one feature makes little difference.
No, it makes a big difference. If you can do something, then it's OK to do it, is something Python tries to implement. If sum() works for everything with an __add__, given current Python language features some people are going to end up with very inefficient code and it will bite some of them (and not necessarily the authors!) at some time. If it doesn't work for every type with __add__, why not? You'll end up playing whack-a-mole with type prohibitions. Ugh.
Sure, but I think all that does is teach people about a cpython specific implementation -- and I doubt naive users get any closer to understanding algorithmic complexity -- all they learn is you should use string.join().
Oh well, not really that big a deal.
Not to Python. Maybe not to you. But I've learned a lot about Pythonic ways of doing things trying to channel the folks who implemented this restriction. (I don't claim to have gotten it right! Just that it's been fun and educational. :-) Steve Footnotes: [1] This isn't quite true. One can imagine a "token" or "symbol" type that is str without __len__, but does have __add__. But that seems silly enough to not be a problem in practice.
On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote:
Chris Barker - NOAA Federal writes:
It seems pretty pedantic to say: we could make this work well, but we'd rather chide you for not knowing the "proper" way to do it.
Nobody disagrees. But backward compatibility gets in the way.
Something that currently doesn't work, starts to. How is that a backward compatibility problem? -- ~Ethan~
Ethan Furman writes:
On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote:
Chris Barker - NOAA Federal writes:
It seems pretty pedantic to say: we could make this work well, but we'd rather chide you for not knowing the "proper" way to do it.
Nobody disagrees. But backward compatibility gets in the way.
Something that currently doesn't work, starts to. How is that a backward compatibility problem?
I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need! That would be a regression and backward incompatible.
On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need!
I can't imagine anyone is suggesting that -- disallow it, but don't tell anyone why? The only thing that is remotely on the table here is: 1) remove the special case for strings -- buyer beware -- but consistent and less "ugly" 2) add a special case for strings that is fast and efficient -- may be as simple as calling "".join() under the hood --no more code than the exception check. And I doubt anyone really is pushing for anything but (2) Steven Turnbull wrote:
IMO we'd also want a homogeneous_iterable ABC
Actually, I've thought for years that that would open the door to a lot of optimizations -- but that's a much broader question that sum(). I even brought it up probably over ten years ago -- but no one was the least bit iinterested -- nor are they now -- I now this was a rhetorical suggestion to make the point about what not to do.... Because obviously we'd want the
attractive nuisance of "if you have __add__, there's a default definition of __sum__"
now I'm confused -- isn't that exactly what we have now? It's possible that Python could provide some kind of feature that
would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking.
does it need to be every type? As it is the common ones work fine already except for strings -- so if we add an optimized string sum() then we're done. *Somebody* will do it
(I don't think anybody is +1 on restricting sum() to a subset of types with __add__).
uhm, that's exactly what we have now -- you can use sum() with anything that has an __add__, except strings. Ns by that logic, if we thought there were other inefficient use cases, we'd restrict those too. But users can always define their own classes that have a __sum__ and are really inefficient -- so unless sum() becomes just for a certain subset of built-in types -- does anyone want that? Then we are back to the current situation: sum() can be used for any type that has an __add__ defined. But naive users are likely to try it with strings, and that's bad, so we want to prevent that, and have a special case check for strings. What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue. The only justification offered here is that will teach people that summing strings (and some other objects?) is order(N^2) and a bad idea. But: a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that) b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()". Will they make the leap to "I shouldn't use string concatenation in a loop, either"? Oh, wait, you can use string concatenation in a loop -- that's been optimized. So will they learn: "some types of object shave poor performance with repeated concatenation and shouldn't be used with sum(). So If I write such a class, and want to sum them up, I'll need to write an optimized version of that code"? I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization. One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn.... -Chris PS: Armin Rigo wrote:
It also improves a lot the precision of sum(list_of_floats) (though not reaching the same precision levels of math.fsum()).
while we are at it, having the default sum() for floats be fsum() would be nice -- I'd rather the default was better accuracy loser performance. Folks that really care about performance could call math.fastsum(), or really, use numpy... This does turn sum() into a function that does type-based dispatch, but isn't python full of those already? do something special for the types you know about, call the generic dunder method for the rest. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Chris Barker <chris.barker@noaa.gov> writes:
What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue.
The only justification offered here is that will teach people that summing strings (and some other objects?) is order(N^2) and a bad idea. But:
a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that)
b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()". Will they make the leap to "I shouldn't use string concatenation in a loop, either"? Oh, wait, you can use string concatenation in a loop -- that's been optimized. So will they learn: "some types of object shave poor performance with repeated concatenation and shouldn't be used with sum(). So If I write such a class, and want to sum them up, I'll need to write an optimized version of that code"?
I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization.
One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn....
AOL to that. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
Redirecting to python-ideas, so trimming less than I might. Chris Barker writes:
On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need!
I can't imagine anyone is suggesting that -- disallow it, but don't tell anyone why?
As I said, it's a regression. That's exactly the behavior in Python 3.2.
The only thing that is remotely on the table here is:
1) remove the special case for strings -- buyer beware -- but consistent and less "ugly"
It's only consistent if you believe that Python has strict rules for use of various operators. It doesn't, except as far as they are constrained by precedence. For example, I have an application where I add bytestrings bytewise modulo N <= 256, and concatenate them. In fact I use function call syntax, but the obvious operator syntax is '+' for the bytewise addition, and '*' for the concatenation. It's not in the Zen, but I believe in the maxim "If it's worth doing, it's worth doing well." So for me, 1) is out anyway.
2) add a special case for strings that is fast and efficient -- may be as simple as calling "".join() under the hood --no more code than the exception check.
Sure, but what about all the other immutable containers with __add__ methods? What about mappings with key-wise __add__ methods whose values might be immutable but have __add__ methods? Where do you stop with the special-casing? I consider this far more complex and ugly than the simple "sum() is for numbers" rule (and even that is way too complex considering accuracy of summing floats).
And I doubt anyone really is pushing for anything but (2)
I know that, but I think it's the wrong solution to the problem (which is genuine IMO). The right solution is something generic, possibly a __sum__ method. The question is whether that leads to too much work to be worth it (eg, "homogeneous_iterable").
Because obviously we'd want the attractive nuisance of "if you have __add__, there's a default definition of __sum__"
now I'm confused -- isn't that exactly what we have now?
Yes and my feeling (backed up by arguments that I admit may persuade nobody but myself) is that what we have now kinda sucks[tm]. It seemed like a good idea when I first saw it, but then, my apps don't scale to where the pain starts in my own usage.
It's possible that Python could provide some kind of feature that would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking.
does it need to be every type? As it is the common ones work fine already except for strings -- so if we add an optimized string sum() then we're done.
I didn't say provide an optimized sum(), I said provide a feature enabling people who want to optimize sum() to do so. So yes, it needs to be every type (the optional __sum__ method is a proof of concept, modulo it actually being implementable ;-).
*Somebody* will do it (I don't think anybody is +1 on restricting sum() to a subset of types with __add__).
uhm, that's exactly what we have now
Exactly. Who's arguing that the sum() we have now is a ticket to Paradise? I'm just saying that there's probably somebody out there negative enough on the current situation to come up with an answer that I think is general enough (and I suspect that python-dev consensus is that demanding, too).
sum() can be used for any type that has an __add__ defined.
I'd like to see that be mutable types with __iadd__.
What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue.
Because inefficient sum() is an attractive nuisance, easy to overlook, and likely to bite users other than the author.
The only justification offered here is that will teach people that summing strings (and some other objects?)
Summing tuples works (with appropriate start=tuple()). Haven't benchmarked, but I bet that's O(N^2).
is order(N^2) and a bad idea. But:
a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that)
My argument is that in practical use sum() is a bad idea, period, until you book up on the types and applications where it *does* work. N.B. It doesn't even work properly for numbers (inaccurate for floats).
b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()".
For people who think that special-casing strings is a good idea, I think this is about as much benefit as you can expect. Why go farther?<0.5 wink/>
I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization.
TOOWTDI. str.join is in pretty much every code base by now, and tutorials and FAQs recommending its user and severely deprecating sum for strings are legion.
One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn....
That assumes they know about the start argument. I think most naive users will just try to sum a bunch of tuples, and get the "can't add 0, tuple" Exception and write a loop. I suspect that many of the users who get the "use str.join" warning along with the Exception are unaware of the start argument, too. They expect sum(iter_of_str) to magically add the strings. Ie, when in 3.2 they got the uninformative "can't add 0, str" message, they did not immediately go "d'oh" and insert ", start=''" in the call to sum, they wrote a loop.
while we are at it, having the default sum() for floats be fsum() would be nice
How do you propose to implement that, given math.fsum is perfectly happy to sum integers? You can't just check one or a few leading elements for floatiness. I think you have to dispatch on type(start), but then sum(iter_of_floats) DTWT. So I would suggest changing the signature to sum(it, start=0.0). This would probably be acceptable to most users with iterables of ints, but does imply some performance hit.
This does turn sum() into a function that does type-based dispatch, but isn't python full of those already? do something special for the types you know about, call the generic dunder method for the rest.
AFAIK Python is moving in the opposite direction: if there's a common need for dispatching to type-specific implementations of a method, define a standard (not "generic") dunder for the purpose, and have the builtin (or operator, or whatever) look up (not "call") the appropriate instance in the usual way, then call it. If there's a useful generic implementation, define an ABC to inherit from that provides that generic implementation.
On Tue, Aug 12, 2014 at 11:21 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Redirecting to python-ideas, so trimming less than I might.
reasonable enough -- you are introducing some more significant ideas for changes. I've said all I have to say about this -- I don't seem to see anything encouraging form core devs, so I guess that's it. Thanks for the fun bike-shedding... -Chris
Chris Barker writes:
On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull < stephen@xemacs.org> wrote:
I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need!
I can't imagine anyone is suggesting that -- disallow it, but don't tell anyone why?
As I said, it's a regression. That's exactly the behavior in Python 3.2.
The only thing that is remotely on the table here is:
1) remove the special case for strings -- buyer beware -- but consistent and less "ugly"
It's only consistent if you believe that Python has strict rules for use of various operators. It doesn't, except as far as they are constrained by precedence. For example, I have an application where I add bytestrings bytewise modulo N <= 256, and concatenate them. In fact I use function call syntax, but the obvious operator syntax is '+' for the bytewise addition, and '*' for the concatenation.
It's not in the Zen, but I believe in the maxim "If it's worth doing, it's worth doing well." So for me, 1) is out anyway.
2) add a special case for strings that is fast and efficient -- may be as simple as calling "".join() under the hood --no more code than the exception check.
Sure, but what about all the other immutable containers with __add__ methods? What about mappings with key-wise __add__ methods whose values might be immutable but have __add__ methods? Where do you stop with the special-casing? I consider this far more complex and ugly than the simple "sum() is for numbers" rule (and even that is way too complex considering accuracy of summing floats).
And I doubt anyone really is pushing for anything but (2)
I know that, but I think it's the wrong solution to the problem (which is genuine IMO). The right solution is something generic, possibly a __sum__ method. The question is whether that leads to too much work to be worth it (eg, "homogeneous_iterable").
Because obviously we'd want the attractive nuisance of "if you have __add__, there's a default definition of __sum__"
now I'm confused -- isn't that exactly what we have now?
Yes and my feeling (backed up by arguments that I admit may persuade nobody but myself) is that what we have now kinda sucks[tm]. It seemed like a good idea when I first saw it, but then, my apps don't scale to where the pain starts in my own usage.
It's possible that Python could provide some kind of feature that would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking.
does it need to be every type? As it is the common ones work fine already except for strings -- so if we add an optimized string sum() then we're done.
I didn't say provide an optimized sum(), I said provide a feature enabling people who want to optimize sum() to do so. So yes, it needs to be every type (the optional __sum__ method is a proof of concept, modulo it actually being implementable ;-).
*Somebody* will do it (I don't think anybody is +1 on restricting sum() to a subset of types with __add__).
uhm, that's exactly what we have now
Exactly. Who's arguing that the sum() we have now is a ticket to Paradise? I'm just saying that there's probably somebody out there negative enough on the current situation to come up with an answer that I think is general enough (and I suspect that python-dev consensus is that demanding, too).
sum() can be used for any type that has an __add__ defined.
I'd like to see that be mutable types with __iadd__.
What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue.
Because inefficient sum() is an attractive nuisance, easy to overlook, and likely to bite users other than the author.
The only justification offered here is that will teach people that summing strings (and some other objects?)
Summing tuples works (with appropriate start=tuple()). Haven't benchmarked, but I bet that's O(N^2).
is order(N^2) and a bad idea. But:
a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that)
My argument is that in practical use sum() is a bad idea, period, until you book up on the types and applications where it *does* work. N.B. It doesn't even work properly for numbers (inaccurate for floats).
b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()".
For people who think that special-casing strings is a good idea, I think this is about as much benefit as you can expect. Why go farther?<0.5 wink/>
I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization.
TOOWTDI. str.join is in pretty much every code base by now, and tutorials and FAQs recommending its user and severely deprecating sum for strings are legion.
One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn....
That assumes they know about the start argument. I think most naive users will just try to sum a bunch of tuples, and get the "can't add 0, tuple" Exception and write a loop. I suspect that many of the users who get the "use str.join" warning along with the Exception are unaware of the start argument, too. They expect sum(iter_of_str) to magically add the strings. Ie, when in 3.2 they got the uninformative "can't add 0, str" message, they did not immediately go "d'oh" and insert ", start=''" in the call to sum, they wrote a loop.
while we are at it, having the default sum() for floats be fsum() would be nice
How do you propose to implement that, given math.fsum is perfectly happy to sum integers? You can't just check one or a few leading elements for floatiness. I think you have to dispatch on type(start), but then sum(iter_of_floats) DTWT. So I would suggest changing the signature to sum(it, start=0.0). This would probably be acceptable to most users with iterables of ints, but does imply some performance hit.
This does turn sum() into a function that does type-based dispatch, but isn't python full of those already? do something special for the types you know about, call the generic dunder method for the rest.
AFAIK Python is moving in the opposite direction: if there's a common need for dispatching to type-specific implementations of a method, define a standard (not "generic") dunder for the purpose, and have the builtin (or operator, or whatever) look up (not "call") the appropriate instance in the usual way, then call it. If there's a useful generic implementation, define an ABC to inherit from that provides that generic implementation.
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 12 Aug 2014 11:21, "Chris Barker - NOAA Federal" <chris.barker@noaa.gov> wrote:
Sorry for the bike shedding here, but:
The quadratic behaviour of repeated str summation is a subtle, silent
error.
OK, fair enough. I suppose it would be hard and ugly to catch those
*is* controversial that CPython silently optimises some cases of it
away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick.
Is there anything in the language spec that says string concatenation is O(n^2)? Or for that matter any of the performs characteristics of build in types? Those striker as implementation details that SHOULD be particular to
instances and raise an exception pointing users to "".join. the implementation. If you implement strings so they have multiple data segments internally (as is the case for StringIO these days), yes, you can avoid quadratic time concatenation behaviour. Doing so makes it harder to meet other complexity expectations (like O(1) access to arbitrary code points), and isn't going to happen in CPython regardless due to C API backwards compatibility constraints. For the explicit loop with repeated concatenation, we can't say "this is slow, don't do it". People do it anyway, so we've opted for the "fine, make it as fast as we can" option as being preferable to an obscure and relatively hard to debug performance problem. For sum(), we have the option of being more direct and just telling people Python's answer to the string concatenation problem (i.e. str.join). That is decidedly *not* the series of operations described in sum's documentation as "Sums start and the items of an iterable from left to right and returns the total." Regards, Nick.
Hi all, The core of the matter is that if we repeatedly __add__ strings from a long list, we get O(n**2) behavior. For one point of view, the reason is that the additions proceed in left-to-right order. Indeed, sum() could proceed in a more balanced tree-like order: from [x0, x1, x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat until there is only one item in the final list. This order ensures that sum(list_of_strings) is at worst O(n log n). It might be in practice close enough from linear to not matter. It also improves a lot the precision of sum(list_of_floats) (though not reaching the same precision levels of math.fsum()). Just a thought, Armin.
On 12 Aug 2014, at 10:02, Armin Rigo <arigo@tunes.org> wrote:
Hi all,
The core of the matter is that if we repeatedly __add__ strings from a long list, we get O(n**2) behavior. For one point of view, the reason is that the additions proceed in left-to-right order. Indeed, sum() could proceed in a more balanced tree-like order: from [x0, x1, x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat until there is only one item in the final list. This order ensures that sum(list_of_strings) is at worst O(n log n). It might be in practice close enough from linear to not matter. It also improves a lot the precision of sum(list_of_floats) (though not reaching the same precision levels of math.fsum()).
I wonder why nobody has mentioned previous year’s discussion of the same issue yet: http://marc.info/?l=python-ideas&m=137359619831497&w=2 Maybe someone can write a PEP about this that can be pointed when the question is discussed again next summer ;-) Ronald
participants (12)
-
Alexander Belopolsky
-
Armin Rigo
-
Ben Hoyt
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Ethan Furman
-
Nick Coghlan
-
Nikolaus Rath
-
Ronald Oussoren
-
Stefan Richthofer
-
Stephen J. Turnbull
-
Terry Reedy