sum(...) limitation
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails. However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. Is this to be considered a bug? Andrea
No. We just can't put all possible use cases in the docstring. :-) On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Andrea
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
No. We just can't put all possible use cases in the docstring. :-)
On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing).
On 8/2/2014 1:57 AM, Allen Li wrote:
On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
No. We just can't put all possible use cases in the docstring. :-)
On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing).
O(n**2) behavior, ''.join(strings) alternative. -- Terry Jan Reedy
On 02.08.2014 08:35, Terry Reedy wrote:
On 8/2/2014 1:57 AM, Allen Li wrote:
On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
No. We just can't put all possible use cases in the docstring. :-)
On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing).
O(n**2) behavior, ''.join(strings) alternative.
hm could this be a pure python case that would profit from temporary elision [0]? lists could declare the tp_can_elide slot and call list.extend on the temporary during its tp_add slot instead of creating a new temporary. extend/realloc can avoid the copy if there is free memory available after the block. [0] https://mail.python.org/pipermail/python-dev/2014-June/134826.html
Julian Taylor schrieb am 02.08.2014 um 12:11:
On 02.08.2014 08:35, Terry Reedy wrote:
On 8/2/2014 1:57 AM, Allen Li wrote:
On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
No. We just can't put all possible use cases in the docstring. :-)
On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing).
O(n**2) behavior, ''.join(strings) alternative.
lists could declare the tp_can_elide slot and call list.extend on the temporary during its tp_add slot instead of creating a new temporary. extend/realloc can avoid the copy if there is free memory available after the block.
Yes, i.e. only sometimes. Better not rely on it in your code. Stefan
Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in https://mail.python.org/pipermail/python-dev/2014-August/135623.html ) wrote:
Andrea Griffini <agriff at tin.it> wrote:
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
hm could this be a pure python case that would profit from temporary elision [ https://mail.python.org/pipermail/python-dev/2014-June/134826.html ]?
lists could declare the tp_can_elide slot and call list.extend on the temporary during its tp_add slot instead of creating a new temporary. extend/realloc can avoid the copy if there is free memory available after the block.
Yes, with all the same problems. When dealing with a complex object, how can you be sure that __add__ won't need access to the original values during the entire computation? It works with matrix addition, but not with matric multiplication. Depending on the details of the implementation, it could even fail for a sort of sliding-neighbor addition similar to the original justification. Of course, then those tricky implementations should not define an _eliding_add_, but maybe the builtin objects still should? After all, a plain old list is OK to re-use. Unless the first evaluation to create it ends up evaluating an item that has side effects... In the end, it looks like a lot of machinery (and extra checks that may slow down the normal small-object case) for something that won't be used all that often. Though it is really tempting to consider a compilation mode that assumes objects and builtins will be "normal", and lets you replace the entire above expression with compile-time [1, 2, 3, 4, 5, 6]. Would writing objects to that stricter standard and encouraging its use (and maybe offering a few AST transforms to auto-generate the out-parameters?) work as well for those who do need the speed? -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ
On 04.08.2014 22:22, Jim J. Jewett wrote:
Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in https://mail.python.org/pipermail/python-dev/2014-August/135623.html ) wrote:
Andrea Griffini <agriff at tin.it> wrote:
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
hm could this be a pure python case that would profit from temporary elision [ https://mail.python.org/pipermail/python-dev/2014-June/134826.html ]?
lists could declare the tp_can_elide slot and call list.extend on the temporary during its tp_add slot instead of creating a new temporary. extend/realloc can avoid the copy if there is free memory available after the block.
Yes, with all the same problems.
When dealing with a complex object, how can you be sure that __add__ won't need access to the original values during the entire computation? It works with matrix addition, but not with matric multiplication. Depending on the details of the implementation, it could even fail for a sort of sliding-neighbor addition similar to the original justification.
The c-extension object knows what its add slot does. An object that cannot elide would simply always return 0 indicating to python to not call the inplace variant. E.g. the numpy __matmul__ operator would never tell python that it can work inplace, but __add__ would (if the arguments allow it). Though we may have found a way to do it without the direct help of Python, but it involves reading and storing the current instruction of the frame object to figure out if it is called directly from the interpreter. unfinished patch to numpy, see the can_elide_temp function: https://github.com/numpy/numpy/pull/4322.diff Probably not the best way as this is hardly intended Python C-API but assuming there is no overlooked issue with this approach it could be a good workaround for known good Python versions.
On Fri, Aug 01, 2014 at 10:57:38PM -0700, Allen Li wrote:
On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote:
No. We just can't put all possible use cases in the docstring. :-)
On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini <agriff@tin.it> wrote:
help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails.
However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists.
Is this to be considered a bug?
Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing).
Repeated list and str concatenation both have quadratic O(N**2) performance, but people frequently build up strings with + and rarely do the same for lists. String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join. Whatever the reason, repeated string concatenation is common whereas repeated list concatenation is much, much rarer (and repeated tuple concatenation even rarer), so sum(strings) is likely to be a land mine buried in your code while sum(lists) is not. Hence the decision that beginners in particular need to be protected from the mistake of using sum(strings) but bothering to check for sum(lists) is a waste of time. Personally, I wish that sum would raise a warning rather than an exception. As for prohibiting anything except numbers with sum(), that in my opinion would be a bad idea. sum(vectors), sum(numeric_arrays), sum(angles) etc. should all be allowed. The general sum() built-in should accept any type that allows + (unless explicitly black-listed), while specialist numeric-only sums could go into modules (like math.fsum). -- Steven
On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve@pearwood.info> wrote:
String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join.
Since sum() already treats strings as a special case, why can't it simply call (an equivalent of) ''.join itself instead of telling the user to do it? It does not matter why "many people dislike or cannot remember to use ''.join" - if this is a fact - it should be considered by language implementors.
Alexander Belopolsky schrieb am 02.08.2014 um 16:52:
On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote:
String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join.
Since sum() already treats strings as a special case, why can't it simply call (an equivalent of) ''.join itself instead of telling the user to do it? It does not matter why "many people dislike or cannot remember to use ''.join" - if this is a fact - it should be considered by language implementors.
I don't think sum(strings) is beautiful enough to merit special cased support. Special cased rejection sounds like a much better way to ask people "think again - what's a sum of strings anyway?". Stefan
On Sat, Aug 2, 2014 at 11:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I don't think sum(strings) is beautiful enough
sum(strings) is more beautiful than ''.join(strings) in my view, but unfortunately it does not work even for lists because the initial value defaults to 0. sum(strings, '') and ''.join(strings) are equally ugly and non-obvious because they require an empty string. Empty containers are an advanced concept and it is unfortunate that a simple job of concatenating a list of (non-empty!) strings exposes the user to it.
On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote:
On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve@pearwood.info> wrote:
String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join.
Since sum() already treats strings as a special case, why can't it simply call (an equivalent of) ''.join itself instead of telling the user to do it? It does not matter why "many people dislike or cannot remember to use ''.join" - if this is a fact - it should be considered by language implementors.
It could, of course, but there is virtue in keeping sum simple, rather than special-casing who knows how many different types. If sum() tries to handle strings, should it do the same for lists? bytearrays? array.array? tuple? Where do we stop? Ultimately it comes down to personal taste. Some people are going to wish sum() tried harder to do the clever thing with more types, some people are going to wish it was simpler and didn't try to be clever at all. Another argument against excessive cleverness is that it ties sum() to one particular idiom or implementation. Today, the idiomatic and efficient way to concatenate a lot of strings is with ''.join, but tomorrow there might be a new str.concat() method. Who knows? sum() shouldn't have to care about these details, since they are secondary to sum()'s purpose, which is to add numbers. Anything else is a bonus (or perhaps a nuisance). So, I would argue that when faced with something that is not a number, there are two reasonable approaches for sum() to take: - refuse to handle the type at all; or - fall back on simple-minded repeated addition. By the way, I think this whole argument would have been easily side-stepped if + was only used for addition, and & used for concatenation. Then there would be no question about what sum() should do for lists and tuples and strings: raise TypeError. -- Steven
On 2014-08-02 16:27, Steven D'Aprano wrote:
On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote:
On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano <steve@pearwood.info> wrote:
String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join.
Since sum() already treats strings as a special case, why can't it simply call (an equivalent of) ''.join itself instead of telling the user to do it? It does not matter why "many people dislike or cannot remember to use ''.join" - if this is a fact - it should be considered by language implementors.
It could, of course, but there is virtue in keeping sum simple, rather than special-casing who knows how many different types. If sum() tries to handle strings, should it do the same for lists? bytearrays? array.array? tuple? Where do we stop?
We could leave any special-casing to the classes themselves: def sum(iterable, start=0): sum_func = getattr(type(start), '__sum__') if sum_func is None: result = start for item in iterable: result = result + item else: result = sum_func(start, iterable) return result
Ultimately it comes down to personal taste. Some people are going to wish sum() tried harder to do the clever thing with more types, some people are going to wish it was simpler and didn't try to be clever at all.
Another argument against excessive cleverness is that it ties sum() to one particular idiom or implementation. Today, the idiomatic and efficient way to concatenate a lot of strings is with ''.join, but tomorrow there might be a new str.concat() method. Who knows? sum() shouldn't have to care about these details, since they are secondary to sum()'s purpose, which is to add numbers. Anything else is a bonus (or perhaps a nuisance).
So, I would argue that when faced with something that is not a number, there are two reasonable approaches for sum() to take:
- refuse to handle the type at all; or - fall back on simple-minded repeated addition.
By the way, I think this whole argument would have been easily side-stepped if + was only used for addition, and & used for concatenation. Then there would be no question about what sum() should do for lists and tuples and strings: raise TypeError.
On Sat, Aug 02, 2014 at 05:39:12PM +1000, Steven D'Aprano wrote:
Repeated list and str concatenation both have quadratic O(N**2) performance, but people frequently build up strings with + and rarely do the same for lists. String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join.
join() isn't preferable in cases where it damages readability while simultaneously providing zero or negative performance benefit, such as when concatenating a few short strings, e.g. while adding a prefix to a filename. Although it's true that join() is automatically the safer option, and especially when dealing with user supplied data, the net harm caused by teaching rote and ceremony seems far less desirable compared to fixing a trivial slowdown in a script, if that slowdown ever became apparent. Another (twisted) interpretation is that since the quadratic behaviour is a CPython implementation detail, and there are alternatives where __add__ is constant time, encouraging users to code against implementation details becomes undesirable. In our twisty world, __add__ becomes *preferable* since the resulting programs more closely resemble pseudo-code. $ cat t.py a = 'this ' b = 'is a string' c = 'as we can tell' def x(): return a + b + c def y(): return ''.join([a, b, c]) $ python -m timeit -s 'import t' 't.x()' 1000000 loops, best of 3: 0.477 usec per loop $ python -m timeit -s 'import t' 't.y()' 1000000 loops, best of 3: 0.695 usec per loop David
On Sat, Aug 2, 2014 at 1:35 PM, David Wilson <dw+python-dev@hmmz.org> wrote:
Repeated list and str concatenation both have quadratic O(N**2) performance, but people frequently build up strings with +
join() isn't preferable in cases where it damages readability while simultaneously providing zero or negative performance benefit, such as when concatenating a few short strings, e.g. while adding a prefix to a filename.
Good point -- I was trying to make the point about .join() vs + for strings in an intro python class last year, and made the mistake of having the students test the performance. You need to concatenate a LOT of strings to see any difference at all -- I know that O() of algorithms is unavoidable, but between efficient python optimizations and a an apparently good memory allocator, it's really a practical non-issue.
Although it's true that join() is automatically the safer option, and especially when dealing with user supplied data, the net harm caused by teaching rote and ceremony seems far less desirable compared to fixing a trivial slowdown in a script, if that slowdown ever became apparent.
and it rarely would. Blocking sum( some_strings) because it _might_ have poor performance seems awfully pedantic. As a long-time numpy user, I think sum(a_long_list_of_numbers) has pathetically bad performance, but I wouldn't block it! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
Good point -- I was trying to make the point about .join() vs + for strings in an intro python class last year, and made the mistake of having the students test the performance.
You need to concatenate a LOT of strings to see any difference at all -- I know that O() of algorithms is unavoidable, but between efficient python optimizations and a an apparently good memory allocator, it's really a practical non-issue.
If only that were the case, but it isn't. Here's a cautionary tale for how using string concatenation can blow up in your face: Chris Withers asks for help debugging HTTP slowness: https://mail.python.org/pipermail/python-dev/2009-August/091125.html and publishes some times: https://mail.python.org/pipermail/python-dev/2009-September/091581.html (notice that Python was SIX HUNDRED times slower than wget or IE) and Simon Cross identified the problem: https://mail.python.org/pipermail/python-dev/2009-September/091582.html leading Guido to describe the offending code as an embarrassment. It shouldn't be hard to demonstrate the difference between repeated string concatenation and join, all you need do is defeat sum()'s prohibition against strings. Run this bit of code, and you'll see a significant difference in performance, even with CPython's optimized concatenation: # --- cut --- class Faker: def __add__(self, other): return other x = Faker() strings = list("Hello World!") assert ''.join(strings) == sum(strings, x) from timeit import Timer setup = "from __main__ import x, strings" t1 = Timer("''.join(strings)", setup) t2 = Timer("sum(strings, x)", setup) print (min(t1.repeat())) print (min(t2.repeat())) # --- cut --- On my computer, using Python 2.7, I find the version using sum is nearly 4.5 times slower, and with 3.3 about 4.2 times slower. That's with a mere twelve substrings, hardly "a lot". I tried running it on IronPython with a slightly larger list of substrings, but I got sick of waiting for it to finish. If you want to argue that microbenchmarks aren't important, well, I might agree with you in general, but in the specific case of string concatenation there's that pesky factor of 600 slowdown in real world code to argue with.
Blocking sum( some_strings) because it _might_ have poor performance seems awfully pedantic.
The rationale for explicitly prohibiting strings while merely implicitly discouraging other non-numeric types is that beginners, who are least likely to understand why their code occasionally and unpredictably becomes catastrophically slow, are far more likely to sum strings than sum tuples or lists. (I don't entirely agree with this rationale, I'd prefer a warning rather than an exception.) -- Steven
Steven D'Aprano schrieb am 04.08.2014 um 20:10:
On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
Good point -- I was trying to make the point about .join() vs + for strings in an intro python class last year, and made the mistake of having the students test the performance.
You need to concatenate a LOT of strings to see any difference at all -- I know that O() of algorithms is unavoidable, but between efficient python optimizations and a an apparently good memory allocator, it's really a practical non-issue.
If only that were the case, but it isn't. Here's a cautionary tale for how using string concatenation can blow up in your face:
Chris Withers asks for help debugging HTTP slowness: https://mail.python.org/pipermail/python-dev/2009-August/091125.html
and publishes some times: https://mail.python.org/pipermail/python-dev/2009-September/091581.html
(notice that Python was SIX HUNDRED times slower than wget or IE)
and Simon Cross identified the problem: https://mail.python.org/pipermail/python-dev/2009-September/091582.html
leading Guido to describe the offending code as an embarrassment.
Thanks for digging up that story.
Blocking sum( some_strings) because it _might_ have poor performance seems awfully pedantic.
The rationale for explicitly prohibiting strings while merely implicitly discouraging other non-numeric types is that beginners, who are least likely to understand why their code occasionally and unpredictably becomes catastrophically slow, are far more likely to sum strings than sum tuples or lists.
Well, the obvious difference between strings and lists (not tuples) is that strings are immutable, so it would seem more obvious at first sight to concatenate strings than to do the same thing with lists, which can easily be extended (they are clearly designed for that). This rational may not apply as much to beginners as to more experienced programmers, but it should still explain why this is so often discussed in the context of string concatenation and pretty much never for lists. As for tuples, their most common use case is to represent a fixed length sequence of semantically different values. That renders their concatenation a sufficiently uncommon use case to make no-one ask loudly for "large scale" sum(tuples) support. Basically, extending lists is an obvious thing, but getting multiple strings joined without using "+"-concatenating them isn't. Stefan
On Mon, Aug 4, 2014 at 11:10 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
Good point -- I was trying to make the point about .join() vs + for strings in an intro python class last year, and made the mistake of having the students test the performance.
You need to concatenate a LOT of strings to see any difference at all
If only that were the case, but it isn't. Here's a cautionary tale for how using string concatenation can blow up in your face:
Chris Withers asks for help debugging HTTP slowness: https://mail.python.org/pipermail/python-dev/2009-August/091125.html
Thanks for that -- interesting story. note that that was not suing sum() in that case though, which is really the issue at hand. It shouldn't be hard to demonstrate the difference between repeated
string concatenation and join, all you need do is defeat sum()'s prohibition against strings. Run this bit of code, and you'll see a significant difference in performance, even with CPython's optimized concatenation:
well, that does look compelling, but what it shows is that sum(a_list_of_strings) is slow compared to ''.join(a_list_of_stings). That doesn't surprise me a bit -- this is really similar to why: a_numpy_array.sum() is going to be a lot faster than: sum(a_numpy_array) and why I'll tell everyone that is working with lots of numbers to use numpy. ndarray.sum know what data type it's deaing with,a nd can do the loop in C. similarly with ''.join() (though not as optimized. But I'm not sure we're seeing the big O difference here at all -- but rather the extra calls though each element in the list's __add__ method. In the case where you already HAVE a big list of strings, then yes, ''.join is the clear winner. But I think the case we're often talking about, and I've tested with students, is when you are building up a long string on the fly out of little strings. In that case, you need to profile the full "append to list, then call join()", not just the join() call: # continued adding of strings ( O(n^2)? ) In [6]: def add_strings(l): ...: s = '' ...: for i in l: ...: s+=i ...: return s Using append and then join ( O(n)? ) In [14]: def join_strings(list_of_strings): ....: l = [] ....: for i in list_of_strings: ....: l.append(i) ....: return ''.join(l) In [23]: timeit add_strings(strings) 1000000 loops, best of 3: 831 ns per loop In [24]: timeit join_strings(strings) 100000 loops, best of 3: 1.87 µs per loop ## hmm -- concatenating is faster for a small list of tiny strings.... In [31]: strings = list('Hello World')* 1000 strings *= 1000 In [26]: timeit add_strings(strings) 1000 loops, best of 3: 932 µs per loop In [27]: timeit join_strings(strings) 1000 loops, best of 3: 967 µs per loop ## now about the same. In [31]: strings = list('Hello World')* 10000 In [29]: timeit add_strings(strings) 100 loops, best of 3: 9.44 ms per loop In [30]: timeit join_strings(strings) 100 loops, best of 3: 10.1 ms per loop still about he same? In [31]: strings = list('Hello World')* 1000000 In [32]: timeit add_strings(strings) 1 loops, best of 3: 1.27 s per loop In [33]: timeit join_strings(strings) 1 loops, best of 3: 1.05 s per loop there we go -- slight advantage to joining..... So this is why we've said that the common wisdom about string concatenating isn't really a practical issue. But if you already have the strings all in a list, then yes, join() is a major win over sum() In fact, I tried the above with sum() -- and it was really, really slow. So slow I didn't have the patience to wait for it. Here is a smaller example: In [22]: strings = list('Hello World')* 10000 In [23]: timeit add_strings(strings) 100 loops, best of 3: 9.61 ms per loop In [24]: timeit sum( strings, Faker() ) 1 loops, best of 3: 246 ms per loop So why is sum() so darn slow with strings compared to a simple loop with += ? (and if I try it with a list 10 times as long it takes "forever") Perhaps the http issue cited was before some nifty optimizations in current CPython? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 08/07/2014 03:06 PM, Chris Barker wrote: [snip timings, etc.] I don't remember where, but I believe that cPython has an optimization built in for repeated string concatenation, which is probably why you aren't seeing big differences between the + and the sum(). A little testing shows how to defeat that optimization: blah = '' for string in ['booyah'] * 100000: blah = string + blah Note the reversed order of the addition. --> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1) [0.021117210388183594, 0.013692855834960938, 0.00768280029296875] --> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1) [15.301048994064331, 15.343288898468018, 15.268463850021362] -- ~Ethan~
On 08/07/2014 04:01 PM, Ethan Furman wrote:
On 08/07/2014 03:06 PM, Chris Barker wrote:
--> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1) [0.021117210388183594, 0.013692855834960938, 0.00768280029296875]
--> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1) [15.301048994064331, 15.343288898468018, 15.268463850021362]
Oh, and the join() timings: --> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah = ''").repeat(3, 1) [0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188] So, + is three orders of magnitude slower than join. -- ~Ethan~
On Thu, Aug 7, 2014 at 4:01 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I don't remember where, but I believe that cPython has an optimization built in for repeated string concatenation, which is probably why you aren't seeing big differences between the + and the sum().
Indeed -- clearly so. A little testing shows how to defeat that optimization: blah = ''
for string in ['booyah'] * 100000: blah = string + blah
Note the reversed order of the addition.
thanks -- cool trick. Oh, and the join() timings:
--> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah = ''").repeat(3, 1) [0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188] So, + is three orders of magnitude slower than join.
only one if if you use the optimized form of + and not even that if you need to build up the list first, which is the common use-case. So my final question is this: repeated string concatenation is not the "recommended" way to do this -- but nevertheless, cPython has an optimization that makes it fast and efficient, to the point that there is no practical performance reason to prefer appending to a list and calling join()) afterward. So why not apply a similar optimization to sum() for strings? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 08/08/2014 08:23 AM, Chris Barker wrote:
So my final question is this:
repeated string concatenation is not the "recommended" way to do this -- but nevertheless, cPython has an optimization that makes it fast and efficient, to the point that there is no practical performance reason to prefer appending to a list and calling join()) afterward.
So why not apply a similar optimization to sum() for strings?
That I cannot answer -- I find the current situation with sum highly irritating. -- ~Ethan~
On Aug 8, 2014, at 11:09 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
So why not apply a similar optimization to sum() for strings?
That I cannot answer -- I find the current situation with sum highly irritating.
It is only irritating if you are misusing sum(). The str.__add__ optimization was put in because it was common for people to accidentally incur the performance penalty. With sum(), we don't seem to have that problem (I don't see people using it to add lists except just to show that could be done). Raymond
On 08/08/2014 05:34 PM, Raymond Hettinger wrote:
On Aug 8, 2014, at 11:09 AM, Ethan Furman <ethan@stoneleaf.us <mailto:ethan@stoneleaf.us>> wrote:
So why not apply a similar optimization to sum() for strings?
That I cannot answer -- I find the current situation with sum highly irritating.
It is only irritating if you are misusing sum().
Actually, I have an advanced degree in irritability -- perhaps you've noticed in the past? I don't use sum at all, or at least very rarely, and it still irritates me. It feels like I'm being told I'm too dumb to figure out when I can safely use sum and when I can't. -- ~Ethan~
On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I don't use sum at all, or at least very rarely, and it still irritates me.
You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but in Python it is 0 + a + b + c. If we had a "join" operator for strings that is different form + - then sure, I would not try to use sum to join strings, but we don't. I have always thought that sum(x) is just a shorthand for reduce(operator.add, x), but again it is not so in Python. While "sum should only be used for numbers," it turns out it is not a good choice for floats - use math.fsum. While "strings are blocked because sum is slow," numpy arrays with millions of elements are not. And try to explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine. Why have builtin sum at all if its use comes with so many caveats?
On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I don't use sum at all, or at least very rarely, and it still irritates me.
You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but in Python it is 0 + a + b + c. If we had a "join" operator for strings that is different form + - then sure, I would not try to use sum to join strings, but we don't.
I've long believed that + is the wrong operator for concatenating strings, and that & makes a much better operator. We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition.
I have always thought that sum(x) is just a shorthand for reduce(operator.add, x), but again it is not so in Python.
The signature of reduce is: reduce(...) reduce(function, sequence[, initial]) -> value so sum() is (at least conceptually) a shorthand for reduce: def sum(values, initial=0): return reduce(operator.add, values, initial) but that's an implementation detail, not a language promise, and sum() is free to differ from that simple version. Indeed, even the public interface is different, since sum() prohibits using a string as the initial value and only promises to work with numbers. The fact that it happens to work with lists and tuples is somewhat of an accident of implementation.
While "sum should only be used for numbers," it turns out it is not a good choice for floats - use math.fsum.
Correct. And if you (generic you, not you personally) do not understand why simple-minded addition of floats is troublesome, then you're going to have a world of trouble. Anyone who is disturbed by the question of "should I use sum or math.fsum?" probably shouldn't be writing serious floating point code at all. Floating point computations are hard, and there is simply no escaping this fact.
While "strings are blocked because sum is slow," numpy arrays with millions of elements are not.
That's not a good example. Strings are potentially O(N**2), which means not just "slow" but *agonisingly* slow, as in taking a week -- no exaggeration -- to concat a million strings. If it takes a nanosecond to concat two strings, then 1e6**2 such concatenations could take over eleven days. Slowness of such magnitude might as well be "the process has locked up". In comparison, summing a numpy array with a million entries is not really slow in that sense. The time taken is proportional to the number of entries, and differs from summing a list only by a constant factor. Besides, in the case of strings it is quite simple to decide "is the initial value a string?", whereas with lists or numpy arrays it's quite hard to decide "is the list or array so huge that the user will consider this too slow?". What counts as "too slow" depends on the machine it is running on, what other processes are running, and the user's mood, and leads to the silly result that summing an array of N items succeeds but N+1 items doesn't. So in the case of strings, it is easy to make a blanket prohibition, but in the case of lists or arrays, there is no reasonable place to draw the line.
And try to explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.
I think that's because sum() has to box up each and every element in the array into an object, which is wasteful, while abs() can delegate to a specialist array.__abs__ method. Although that's not something beginners should be expected to understand, no serious Python programmer should be confused by this. As a programmer, we should expect to have some understanding of our tools, how they work, their limitations, and when to use a different tool. That's why numpy has its own version of sum which is designed to work specifically on numpy arrays. Use a specialist tool for a specialist job: py> with Stopwatch(): ... sum(carray) # carray is a numpy array of 75000000 floats. ... 112500000.0 time taken: 52.659770 seconds py> with Stopwatch(): ... numpy.sum(carray) ... 112500000.0 time taken: 0.161263 seconds
Why have builtin sum at all if its use comes with so many caveats?
Because sum() is a perfectly reasonable general purpose tool for adding up small amounts of numbers where high floating point precision is not required. It has been included as a built-in because Python comes with "batteries included", and a basic function for adding up a few numbers is an obvious, simple battery. But serious programmers should be comfortable with the idea that you use the right tool for the right job. If you visit a hardware store, you will find that even something as simple as the hammer exists in many specialist varieties. There are tack hammers, claw hammers, framing hammers, lump hammers, rubber and wooden mallets, "brass" non-sparking hammers, carpet hammers, brick hammers, ball-peen and cross-peen hammers, and even more specialist versions like geologist's hammers. Bashing an object with something hard is remarkably complicated, and there are literally dozens of types and sizes of "the hammer". Why should it be a surprise that there are a handful of different ways to sum items? -- Steven
Steven D'Aprano wrote:
I've long believed that + is the wrong operator for concatenating strings, and that & makes a much better operator.
Do you have a reason for preferring '&' in particular, or do you just want something different from '+'? Personally I can't see why "bitwise and" on strings should be a better metaphor for concatenation that "addition". :-) -- Greg
Le 09/08/2014 01:08, Steven D'Aprano a écrit :
On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I don't use sum at all, or at least very rarely, and it still irritates me.
You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but in Python it is 0 + a + b + c. If we had a "join" operator for strings that is different form + - then sure, I would not try to use sum to join strings, but we don't.
I've long believed that + is the wrong operator for concatenating strings, and that & makes a much better operator. We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition.
Come on. These arguments are interminable because many people (including you) love feeding interminable arguments. No need to blame Python for that. And for that matter, this interminable discussion should probably have taken place on python-ideas or even python-list. Regards Antoine.
On 9 August 2014 06:08, Steven D'Aprano <steve@pearwood.info> wrote:
py> with Stopwatch(): ... sum(carray) # carray is a numpy array of 75000000 floats. ... 112500000.0 time taken: 52.659770 seconds py> with Stopwatch(): ... numpy.sum(carray) ... 112500000.0 time taken: 0.161263 seconds
Why have builtin sum at all if its use comes with so many caveats?
Because sum() is a perfectly reasonable general purpose tool for adding up small amounts of numbers where high floating point precision is not required. It has been included as a built-in because Python comes with "batteries included", and a basic function for adding up a few numbers is an obvious, simple battery. But serious programmers should be comfortable with the idea that you use the right tool for the right job.
Changing the subject a little, but the Stopwatch function you used up there is "an obvious, simple battery" for timing a chunk of code at the interactive prompt. I'm amazed there's nothing like it in the timeit module... Paul
On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano <steve@pearwood.info> wrote:
We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition.
But we would probably have a similar discussion about all(). :-) Use of + is consistent with the use of * for repetition. What would you use use for repetition if you use & instead? Compare, for example s + ' ' * (n - len(s)) and s & ' ' * (n - len(s)) Which one is clearer? It is sum() that need to be fixed, not +. Not having sum([a, b]) equivalent to a + b for any a, b pair is hard to justify.
On Sat, Aug 9, 2014 at 12:20 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano <steve@pearwood.info> wrote:
We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition.
But we would probably have a similar discussion about all(). :-)
Use of + is consistent with the use of * for repetition. What would you use use for repetition if you use & instead?
If the only goal is to not be tempted to use sum() for string concatenation, how about using *? This is more consistent with mathematics terminology, where a * b is not necessarily the same as b * a (unlike +, which is commutative). As an example, consider matrix multiplication. Then, to answer your question, repetition would have been s ** n. (In fact, this is the notation for concatenation and repetition used in formal language theory.) (If we really super wanted to add this to Python, obviously we'd use the @ and @@ operators. But it's a bit late for that.) -- Devin
Alexander Belopolsky writes:
Why have builtin sum at all if its use comes with so many caveats?
Because we already have it. If the caveats had been known when it was introduced, maybe it wouldn't have been. The question is whether you can convince python-dev that it's worth changing the definition of sum(). IMO that's going to be very hard to do. All the suggestions I've seen so far are (IMHO, YMMV) just as ugly as the present situation.
On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
All the suggestions I've seen so far are (IMHO, YMMV) just as ugly as the present situation.
What is ugly about allowing strings? CPython certainly has a way to to make sum(x, '') at least as efficient as y='';for in in x; y+= x is now. What is ugly about making sum([a, b, ..]) be equivalent to a + b + .. so that non-empty lists of arbitrary types can be "summed"? What is ugly about harmonizing sum(x) and reduce(operator.add, x) behaviors?
Alexander Belopolsky writes:
On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
All the suggestions I've seen so far are (IMHO, YMMV) just as ugly as the present situation.
What is ugly about allowing strings? CPython certainly has a way to to make sum(x, '')
sum(it, '') itself is ugly. As I say, YMMV, but in general last I heard arguments that are usually constants drawn from a small set of constants are considered un-Pythonic; a separate function to express that case is preferred. I like the separate function style. And that's the current situation, except that in the case of strings it turns out to be useful to allow for "sums" that have "glue" at the joints, so it's spelled as a string method rather than a builtin: eg, ", ".join(paramlist). Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics).
On Aug 10, 2014, at 05:24 PM, Stephen J. Turnbull wrote:
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics).
Ever since ''.join was added, there has been vague talk about adding a join() built-in. If the semantics and argument syntax can be worked out, I'd still be in favor of that. Probably deserves a PEP and a millithread community bikeshed paintdown. -Barry
On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics). Actually, there is no need to wait for 0.sum() to propose "".sum... but it is only a spelling change, so no real benefit.
Thinking about this more, maybe it should be a class function, so that it wouldn't require an instance: str.sum( iterable_containing_strings ) [ or str.join( iterable_containing_strings ) ]
On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics). Actually, there is no need to wait for 0.sum() to propose "".sum... but it is only a spelling change, so no real benefit.
Thinking about this more, maybe it should be a class function, so that it wouldn't require an instance:
str.sum( iterable_containing_strings )
[ or str.join( iterable_containing_strings ) ]
That's how it used to be spelled in python2. --David
On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote:
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics). Actually, there is no need to wait for 0.sum() to propose "".sum... but it is only a spelling change, so no real benefit.
Thinking about this more, maybe it should be a class function, so that it wouldn't require an instance:
str.sum( iterable_containing_strings )
[ or str.join( iterable_containing_strings ) ]
Sorry, I mean 'string.join' is how it used to be spelled. Making it a class method is indeed slightly different. --David
participants (22)
-
Alexander Belopolsky
-
Allen Li
-
Andrea Griffini
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Barker
-
David Wilson
-
Devin Jeanpierre
-
Ethan Furman
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Jim J. Jewett
-
Julian Taylor
-
MRAB
-
Paul Moore
-
R. David Murray
-
Raymond Hettinger
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy