Intermediate Summary: Fast sum() for non-numbers
Hello, python-ideas. I don't want to call it "summary" yet because I hope there're other ideas out there that would be suggested later. So let's call it "Intermediate Summary". I also shifted emphasis in this summary a little, to show that a faster sum() may be just a side-effect of some ideas. Introduction ============ sum() is a great function because it makes the code simple. It was never restricted to numbers, since it's also useful for adding types like timedelta or different numpy types. Unfortunately sum() performance is INCONSISTENT. I.e. it's O(N) for numbers, but it's O(N*N) for many containers. As a result: sum( [123]*1000000, 0 ) is instant, but: sum( [[1,2,3]]*1000000, [] ) takes forever to complete. It's worth to note, that sum() is one of the most commonly suggested options to add lists [1], despite usually someone also comes and says that it may be slow. This means, that people at least often try to use sum() for lists. That case was also explicitly mentioned in comments to sum() sources. So the problem is not hypothetical. This thread was not the first time when people discussed how this problem can be solved [2]. But this is probably the most deep and detailed discussion of it. Alternatives ============ During the thread there were many alternatives suggested. Among them: * Sum is not obvious (for everyone) way to add lists, so people should not use it, as there're alternatives, i.e. instead of - sum(list_of_lists, []) one can use: - reduce(operator.iadd, list_of_lists, []) - list(itertools.chain.from_iterable(list_of_lists)) - result = [] for x in list_of_lists: result.extend(x) * Alternatives should be more visible, so that people would spotted them earlier. For example there was a suggestion to move chain (or chain.from_iterable) from itertools into builtins, optionally changing its name. * Another alternative was to introduce "reduce" in builtins under a new name "fold", optionally with adjusted parameters to have them easier to understand. * And one more alternative suggested was do nothing. I can't understand it, so I'll just quote:
You're saying what IS NOT your preferred way. But I'm asking what IS your preferred way. Do you prefer to have a slow sum in python and people asking why it's slow forever? Do you see that as a best possible case for python? YES. Godamnit YES. 100% true-to-form YES.
But I believe that all of them are workarounds (better workarounds, maybe, but still workaround), that do not solve original problem. It's like if you have a similar bug [3] in httplib you could answer that python is not the best tool to download 110Mb files, or you could suggest some customhttpclient instead, but that would not fix the bug in httplib. I'm trying to address this particular issue of sum() inconsistency being O(N) for numbers, but O(N*N) for many containers. And trying to find options that could give sum a chance to be O(N) for most use cases. Ideas ===== 0. Original patch. Original patch [4] was suggesting to use "+" for first item and "+=" for everything else. That could make sum faster not only for lists but also for all the types having __iadd__ faster than __add__. But a clever example from Oscar [5] shown, that "+" can be different from "+=" e.g. due to coercion: >>> from numpy import array >>> a = array([1, 2, 3], dtype=int) >>> b = array([1.4, 2.5, 3.6], dtype=float) >>> a + b array([ 2.4, 4.5, 6.6]) >>> a += b >>> a array([2, 4, 6]) I suggest (patch [6]) to mention this example explicitly in comments for sum(), so that future developers would not be tempted to follow this approach. 1. Fast custom types. Is it fine to have types that are O(N) summable? This question looks weird, since sum() is already O(N) for some types and O(N*N) for others, but I was several times told things like:
Today, sum is not the best way to concatenate sequences. Making it work better for some sequences but not others would mean it's still not the best way to concatenate sequences, but it would _appear_ to be. That's the very definition of an attractive nuisance.
So I want to clear this out: is it normal that sum() is fast for some types (e.g. numbers, timedeltas or FastTuples from [7]) and slow for others? Or sum just MUST BE slow and the very existence of faster containers is "attractive nuisance", and those should be expelled or slowed down too? 2. Fast built-in types. If the answer is "Yes", i.e. there's nothing bad in some types being faster than others, and there's nothing bad in sum() being faster for some types. Then what do you think about optimizing built-in types, like lists, tuples and strings? The optimization itself could have different benefits: * lower memory usage * instant creation of one instance from another * faster add operation (i.e. "c = a + b") * having it common to lists and tuples could give instant conversion of lists to tuples and back But as a side-effect of such optimization sum() will be O(N) for those types. So the question is: if it's fine for some types to be faster than others, then is it fine if those "some types" are lists or tuples? I.e. what if lists and tuples would be O(N) to sum, while some other types are not? To clear the question whether this is possible I wrote [7] a very limited version of tuple, using that approach, so that: from fasttuple import ft x = sum( [ft([1,2,3])] * 1000000, ft([]) ) takes a few seconds under a usual unpatched python. 3. Fast built-in types just for sum [8]. If it's not bad to have lists/tuples faster than other types, then how about just implementing a small special case for them? Advantage: patch is much shorter [8], than optimization for lists and tuples would be, and it would probably cover most (if not all) usages of sum() for containers. Disadvantage: this patch is limited to sum(), i.e. it gives the same performance for sum() as #2, but it does not optimize general "c = a + b" code as #2 would. Side-note: sum() was designed with special cases in mind from the beginning [9]. Initial sum() patch had a special case for strings (''.join() was used for them) but due to a complexity of code for mixed lists it was replaced by another special case rejecting strings. Currently sum has 5 special cases: three to reject strings/bytes and two optimizations for ints and floats. 4. If that would be considered a bad idea, i.e. if we decide that other custom types should have a fair chance to implement some optimization, then how about declaring a unified way to initialize a container type from iterable? Something like an optional method __init_concatenable_sequence_from_iterable__. Most containers already have such method as a default constructor, i.e. it should be trivial to implement it for any existing type. As a side-effect of this unification sum() could have a general optimization for such types, something like: if hasattr(type(start), "__init_concatenable_container_from_iterable__"): optimized_for = type(start) l_result = list() l_result.extend(start) try: while True: item = next(it) if type(item) is not optimized_for: start = optimized_for.__init_concatenable_container_from_iterable__(l_result) start = start + item break l_result.extend(item) except StopIteration: return optimized_for.__init_concatenable_container_from_iterable__(l_result) (I don't like that long __ name, it was just first I thought about. Better names are welcome. __iccfi__?) 5. Alternative unification idea was suggested by MRAB: . Get the first item, or the start value . Ask the item for an 'accumulator' (item.__accum__()). . Add the first item to the accumulator. . Add the remaining items to the accumulator. . Ask the accumulator for the result. If there's no accumulator available (the "__accum__" method isn't implemented), then either fall back to the current behaviour or raise a TypeError like it currently does for strings. I *guess* sum() was never discussed before so deeply because many people either hate sum() or believe that nothing can be done, so they don't even try to find any options. Or maybe they just prefer to use some easily available workaround and don't spend time trying to find other solutions. PS: If you can, please, explain what exactly you [don't] like in the ideas above, so I could have a chance to modify the ideas according to your opinions. PPS: I'm sorry if I missed some of the suggestions, please, add them too. PPPS: I'm also sorry not to answer to some emails during the thread. I read all of them, but there were already too many of messages in the thread, so I was trying to limit number of emails I write. If I missed something important, please, forward your email to me privately and I'll answer you either directly or on list, if you wish. -- [1] Some stackoverflow questions where sum() is suggested for lists: http://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-pytho... http://stackoverflow.com/questions/716477/join-list-of-lists-in-python http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-... http://stackoverflow.com/questions/3021641/concatenation-of-many-lists-in-py... http://stackoverflow.com/questions/7895449/merging-a-list-of-lists http://stackoverflow.com/questions/10424219/combining-lists-into-one http://stackoverflow.com/questions/11331908/how-to-use-reduce-with-list-of-l... http://stackoverflow.com/questions/17142101/concatenating-sublists-python [2] Some earlier attempts to discuss solutions to sum() slowness: 2006-01-12 http://article.gmane.org/gmane.comp.python.general/441831
A fast implementation would probably allocate the output list just once and then stream the values into place with a simple index. That's what I hoped "sum" would do, but instead it barfs with a type error.
2010-03-27 http://article.gmane.org/gmane.comp.python.general/658537 The mildly surprising part of sum() is that is does add vs. add-in- place, which leads to O(N) vs. O(1) for the inner loop calls, for certain data structures, notably lists, even though none of the intermediate results get used by the caller. For lists, you could make a more efficient variant of sum() that clones the start value and does add-in-place. [3] http://bugs.python.org/issue6838 *** httplib's _read_chunked extremely slow for lots of chunks *** As the comment in this method suggests, accumulating the value by repeated string concatenation is slow. Appending to a list speeds this up dramatically. [4] http://bugs.python.org/file30705/fastsum.patch Original patch using "+" for first item and "+=" for the rest. [5] http://bugs.python.org/issue18305#msg192873 Oscar Benjamin: This "optimisation" is a semantic change. It breaks backward compatibility in cases where a = a + b and a += b do not result in the name a having the same value. In particular this breaks backward compatibility for numpy users: [...] [6] http://bugs.python.org/issue18305#msg192956 http://bugs.python.org/file30904/fastsum-iadd_warning.patch Extended warning in sum() comments with more important numpy cases. [7] http://bugs.python.org/issue18305#msg193048 http://bugs.python.org/file30917/fasttuple.py fasttuple.py is a Proof-of-Concept implementation of tuple, that reuses same data storage when possible. Its possible usage looks similar to built-in tuples [...]. An interesting side-effect of this implementation is a faster __add__ operator: Adding 100000 of fasttuples took 0.23242688179 seconds Adding 100000 of built-in tuples took 25.2749021053 seconds [8] http://bugs.python.org/issue18305#msg192919 http://bugs.python.org/file30897/fastsum-special-tuplesandlists.patch Patch implementing a special case for tuples and lists. Should work for both Python 2.7 and Python 3.3, and should introduce no behavior change. [9] http://mail.python.org/pipermail/python-dev/2003-April/034767.html Alex Martelli (author of sum())
for the simple reason that I special-case this -- when the first item is a PyBaseString_Type, I delegate to ''.join
On 14 July 2013 20:26, Sergey <sergemp@mail.ru> wrote:
I'm trying to address this particular issue of sum() inconsistency being O(N) for numbers, but O(N*N) for many containers. And trying to find options that could give sum a chance to be O(N) for most use cases.
sum is currently simple. It just adds each element of the iterable it's called with using +. That's it. (OK, it also checks for strings and rejects them...) result = start for elem in iterable: result = result + elem The performance complexity follows directly from that definition. It is no more surprising that numbers are O(N) and containers O(N*N) than it is that the loop I show above has that complexity. That is to say, it *is* surprising, if you don't know complexity calculations very well, but it's a fundamental and trivial result once you do. Changing the performance of sum() makes reasoning about its complexity *harder* because you can't refer back to that very basic equivalence above to inform your assumptions. The suggestion of using += has merit because it keeps the equivalence simple, so that people can still reason about complexity in a straightforward manner. But there are backward compatibility concerns to review and assess. Performance is not the only consideration. But this is the underlying situation around the performance debates, as I see it. (And people saying "don't change anything" are quite possibly doing so because they view understandable performance guarantees as more important than the "attractive nuisance" of something that is often fast, but it's hard to be sure exactly when.) Paul
On Jul 14, 2013 Paul Moore wrote:
sum is currently simple.
Sum currently looks like: def sum(seq, start = 0): it = iter(seq) if isinstance(start, str): raise TypeError( "sum() can't sum strings [use ''.join(seq) instead]") if isinstance(start, bytes): raise TypeError( "sum() can't sum bytes [use b''.join(seq) instead]") if isinstance(start, bytearray): raise TypeError( "sum() can't sum bytearray [use b''.join(seq) instead]") # SPECIAL CASES if type(start) is int: i_result = int(start, overflow) if not overflow: try: start = None while start is None: item = next(it) if isinstance(item, int): b = int(item, overflow) x = i_result + b if not overflow and ((x^i_result) >= 0 or (x^b) >= 0) i_result = x continue start = i_result start = start + item except StopIteration: return i_result if type(start) is float: f_result = float(start) try: start = None while start is None: item = next(it) if isinstance(item, float): f_result += float(item) continue if isinstance(item, int): value = int(item, overflow) if not overflow: f_result += float(value); continue start = f_result start = start + item except StopIteration: return f_result # END OF SPECIAL CASES try: while True: item = next(it) result = result + item except StopIteration: return start
result = start for elem in iterable: result = result + elem
The performance complexity follows directly from that definition.
First, nothing says that sum is implemented like that. And not everyone assume it is: Paul Rubin @ 2006-01-12
A fast implementation would probably allocate the output list just once and then stream the values into place with a simple index. That's what I hoped "sum" would do, but instead it barfs with a type error. So much for duck typing.
It is no more surprising that numbers are O(N) and containers O(N*N) than it is that the loop I show above has that complexity.
Second, line "result = result + elem" does not mean that performance should be O(N*N) for containers. FastTuple container is a simple O(N) proof of that concept (see #1 and #2 in "intermediate summary").
That is to say, it *is* surprising, if you don't know complexity calculations very well, but it's a fundamental and trivial result once you do.
I do know. And I do understand that containers MAY be O(N*N) but do NOT have to, as my fasttuple example shows. But many beginners may not know that, they may just reasonably expect that python is a good language, and it's a language with dynamic typing, so its creators are probably smart guys, and their functions work similarly fast for all the types. No, really, if you put aside all your deep python knowledge and take a fresh look at line "result = result + item", what part of that line makes you think that you MUST walk through the items of "result"? I.e. nothing stops a "smart python interpreter" to spot result on both sides and optimize it to perform operation in-place. That's how a beginner could think about it. Hey, even in this list of experienced people it wasn't obvious to many that inplace modification (i.e. "+=") could be so much different from "+" in commonly used code! Beginners may not even know about __[i]add__, even less they would expect them to be much different.
Changing the performance of sum() makes reasoning about its complexity *harder* because you can't refer back to that very basic equivalence above to inform your assumptions.
Yes, I understand your reasons, you're trying to say that now you have a simple explanation of sum performance: "sum is O(N) for numbers and O(N*N) for containers". But you're wrong in both statements. First: sum is O(N) for many types, not just for numbers. E.g. it's O(N) for timedeltas, and for numpy arrays (yes, it's O(N) for numpy arrays, meaning that sum of 1000 arrays is 10 times slower than sum of 100 arrays). It's also not O(N*N) for all containers, and my fasttuple example is an evidence [1]. So your can't refer to that reasoning, because it's already wrong! Hm... If my patch shows that wrong explanation is wrong, does it adds points to the patch or takes them? ;)
The suggestion of using += has merit because it keeps the equivalence simple, so that people can still reason about complexity in a straightforward manner. But there are backward compatibility concerns to review and assess.
Yeah, I liked it too, but Oscar's numpy example have killed my optimism. So my current favourite is #2, I wish it was easy to do. And from practical point of view #3 should cover most use cases I could find and was easy to implement, so if "practicality beats purity" then #3 is the winner.
Performance is not the only consideration. But this is the underlying situation around the performance debates, as I see it. (And people saying "don't change anything" are quite possibly doing so because they view understandable performance guarantees as more important than the "attractive nuisance" of something that is often fast, but it's hard to be sure exactly when.)
Well, if they say that because of that guarantee, it means they're wrong — they're trying to keep something they don't have. I guess after 10 years of slow sum people got used to it and don't believe there can be anything good any more. That's why I'm here, I'm trying to find something good and bring the faith back to people. :) E.g. they probably don't assume that faster sum could be a side-effect of tuple being faster. I.e. would you reject the patch making tuple much faster in many cases, just because one of those cases is sum()? -- [1] http://bugs.python.org/file30917/fasttuple.py Adding 100000 of fasttuples took 0.23242688179 seconds Adding 100000 of built-in tuples took 25.2749021053 seconds
On Sun, Jul 14, 2013 at 12:26 PM, Sergey <sergemp@mail.ru> wrote:
* Sum is not obvious (for everyone) way to add lists, so people should not use it, as there're alternatives, i.e. instead of - sum(list_of_lists, []) one can use: - reduce(operator.iadd, list_of_lists, []) - list(itertools.chain.from_iterable(list_of_lists)) - result = [] for x in list_of_lists: result.extend(x)
It seems to me that in order to make sum() look more attractive, Sergey presents ugly versions of alternative ways to (efficiently) concatenate sequences. One can make these look much nicer, e.g. (assuming there is a 'from itertools import chain' at the very top of the file, which is the sensible place to put it). # If 'list_of_lists' really is as it is named, there is no need to treat it # as generic iterable. Moreover, one doesn't usually need to make an # actual instantiated list from chain() for most purposes. So: flat = chain(list_of_lists) # If we do start with an iterable of lists, but know it isn't infinite, just use: flat = chain(*iter_of_lists) If it is really needed, of course chain.from_iterable() can be used. Although the only time you'd want that is when the iterable is potentially infinite, and in that case you *definitely* don't want to make it back into a list either, just: inf_flat = chain.from_iterable(endless_lists) Another approach in one of the links Sergey gave is nice too, and shorter and more elegant than any of his alternatives: flat = [] map(flat.extend, list_of_lists) Using map() for a side effect is slightly wrong, but this is short, readable, and obvious in purpose. On the other hand, as I've said before, when I read: flat = sum(list_of_lists, []) It just looks WRONG! Yes, I know why it works, because of some quirks of Python internals. But it absolutely doesn't *read* like it should mean what it does or that it should necessarily even work at all. The word SUM is self-evidently and intuitively about *adding numbers* and *not* about "doing something that is technically supported because other things have an .__add__() method". As various people have observed, if Python used some other operator for concatenation, we wouldn't be having this discussion at all. E.g. if we had: concat = [1, 2, 3] . [4, 5, 6] Then we might have a method called .__concat__() on various collections. Conceptually that really is what Python is doing now. It's just that Guido made the very reasonable decision that the symbol "+" was something users could intuitively read as meaning concatenation when appropriate, but as addition in other cases. I definitely don't prefer some other operator than '+' to concatenate sequences. However, I think possibly if I had a time machine I might go back and change the spelling of .__add__() to .__plus__(). That might more clearly indicate that we don't really mean "mathematical addition" but rather simply "what the plus sign does". -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Sun, Jul 14, 2013 at 1:42 PM, David Mertz <mertz@gnosis.cx> wrote:
Another approach in one of the links Sergey gave is nice too, and shorter and more elegant than any of his alternatives:
flat = [] map(flat.extend, list_of_lists)
Using map() for a side effect is slightly wrong, but this is short, readable, and obvious in purpose.
Oh yeah. That's a Python 2.x thing. Now map() doesn't actually have side-effects. So you'd need to do: flat = [] set(map(flat.extend, list_of_lists)) That starts to border on ugly. One might also try: flat = [] [flat.extend(l) for l in list_of_lists] But I'm not thrilled by how that reads either. Using the chain() versions is just nicer. Moreover, if you insist on concrete collections out of it, you can take your pick (unlike sum().. although you can obviously wrap that answer in a constructor too): flat_tup = tuple(chain(*iter_of_lists)) flat_set = set(chain(list_of_lists)) -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 14 July 2013 21:42, David Mertz <mertz@gnosis.cx> wrote:
On Sun, Jul 14, 2013 at 12:26 PM, Sergey <sergemp@mail.ru> wrote:
* Sum is not obvious (for everyone) way to add lists, so people should not use it, as there're alternatives, i.e. instead of - sum(list_of_lists, []) one can use: - reduce(operator.iadd, list_of_lists, []) - list(itertools.chain.from_iterable(list_of_lists)) - result = [] for x in list_of_lists: result.extend(x)
It seems to me that in order to make sum() look more attractive, Sergey presents ugly versions of alternative ways to (efficiently) concatenate sequences.
One can make these look much nicer, e.g. (assuming there is a 'from itertools import chain' at the very top of the file, which is the sensible place to put it).
# If 'list_of_lists' really is as it is named, there is no need to treat it # as generic iterable. Moreover, one doesn't usually need to make an # actual instantiated list from chain() for most purposes. So: flat = chain(list_of_lists)
This does nothing more that iter(list)...
# If we do start with an iterable of lists, but know it isn't infinite, just use: flat = chain(*iter_of_lists)
Just use chain.from_iterable(...) here too, it might be longer but it's more flexible and has almost no downsides. This just does redundant work for the sake of saving a few characters.
If it is really needed, of course chain.from_iterable() can be used. Although the only time you'd want that is when the iterable is potentially infinite, and in that case you *definitely* don't want to make it back into a list either, just:
inf_flat = chain.from_iterable(endless_lists)
Another approach in one of the links Sergey gave is nice too, and shorter and more elegant than any of his alternatives:
flat = [] map(flat.extend, list_of_lists)
Gah! No like. flat = [] for lst in list_of_lists: flat.extend(lst) is no longer and also doesn't force you to "deque(maxlen=0).extend(...)" it.
Using map() for a side effect is slightly wrong, but this is short, readable, and obvious in purpose.
I disagree somewhat.
On the other hand, as I've said before, when I read:
flat = sum(list_of_lists, [])
It just looks WRONG!
I definitely don't think this is nearly as bad as map(flat.extend, list_of_lists); not only is this *defined* to work
Yes, I know why it works, because of some quirks of Python internals.
You think that "[1, 2, 3] + [4, 5, 6] == [1, 2, 3, 4, 5, 6]" is a quirk of python's internals?
But it absolutely doesn't *read* like it should mean what it does
You can look up the term "sum" -- it absolutely does.
or that it should necessarily even work at all. The word SUM is self-evidently and intuitively about *adding numbers*
No it's not.
and *not* about "doing something that is technically supported because other things have an .__add__() method".
Again, this is wrong. https://en.wikipedia.org/wiki/Summation
Besides numbers, other types of values can be added as well: vectors, matrices, polynomials and, in general, elements of any additive group (or even monoid).
Google's (aggregated) dictionary:
The total amount resulting from the addition of two or more numbers, amounts, or items the final aggregate; "the sum of all our troubles did not equal the misery they suffered" (a good example of where you *already know* you can sum things other than numbers*)
So first tell me why it makes sense to sum "misery" but not "lists". How is "misery" more like a number than a "list"?
As various people have observed, if Python used some other operator for concatenation, we wouldn't be having this discussion at all. E.g. if we had:
concat = [1, 2, 3] . [4, 5, 6]
Then we might have a method called .__concat__() on various collections. Conceptually that really is what Python is doing now. It's just that Guido made the very reasonable decision that the symbol "+" was something users could intuitively read as meaning concatenation when appropriate, but as addition in other cases.
I definitely don't prefer some other operator than '+' to concatenate sequences. However, I think possibly if I had a time machine I might go back and change the spelling of .__add__() to .__plus__(). That might more clearly indicate that we don't really mean "mathematical addition" but rather simply "what the plus sign does".
I agree with none of this (except the start: if Python used some other operator we'd only be having *different* discussions).
On Mon, Jul 15, 2013 at 12:16 AM, Joshua Landau <joshua@landau.ws> wrote:
flat = chain(list_of_lists)
This does nothing more that iter(list)...
Right, I forgot a '*'. I don't think that changes the point that it's already very readable, intuitive, and efficient, without resorting to the counter-intuitive sum().
flat = [] map(flat.extend, list_of_lists)
Gah! No like.
flat = [] for lst in list_of_lists: flat.extend(lst)
Well, a couple characters difference, but the explicit loop is fine also.
You think that "[1, 2, 3] + [4, 5, 6] == [1, 2, 3, 4, 5, 6]" is a quirk of python's internals?
Basically yes. Except not quite. The "quirk" is more that "the plus sign stands for the .__add__() method, even when it is being used for very different meanings on different datatypes." And again, as I point out, it's not *necessary* that Python had chosen the "+" operator as its spelling of "concatenation" ... it's a good choice, but it does invite confusion with the very different use of "+" to mean "addition". https://en.wikipedia.org/wiki/Summation
Besides numbers, other types of values can be added as well: vectors, matrices, polynomials and, in general, elements of any additive group (or even monoid).
Yep. And summing every one of those things means *addition* and never *concatenation*. There's a reason that article begins with "*Summation* is the operation of adding <https://en.wikipedia.org/wiki/Addition> a sequence<https://en.wikipedia.org/wiki/Sequence>of numbers" It's also notable that concatenation of sequences doesn't form an Abelian Group. Hell, concatenation of sequences isn't even commutative. Using sum() for a non-commutative operation verges on crazy. At the least, such a use is highly counter-intuitive.
Google's (aggregated) dictionary:
The total amount resulting from the addition of two or more numbers, amounts, or items the final aggregate; "the sum of all our troubles did not equal the misery they suffered" (a good example of where you *already know* you can sum things other than numbers*)
Summing troubles might resemble addition, metaphorically. It most certainly does not resemble concatenation. *Maybe* somewhere in the history of English usage you can find some oddball use where the meaning is vaguely similar to "concatenation." This is certainly not the common usage though. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 15 July 2013 17:42, David Mertz <mertz@gnosis.cx> wrote:
On Mon, Jul 15, 2013 at 12:16 AM, Joshua Landau <joshua@landau.ws> wrote:
flat = chain(list_of_lists)
This does nothing more that iter(list)...
Right, I forgot a '*'. I don't think that changes the point that it's already very readable, intuitive, and efficient
I agree in that chain.from_iterable is currently TOOWTDI and I don't think need that to change.
, without resorting to the counter-intuitive sum().
You think that "[1, 2, 3] + [4, 5, 6] == [1, 2, 3, 4, 5, 6]" is a quirk of python's internals?
Basically yes. Except not quite. The "quirk" is more that "the plus sign stands for the .__add__() method, even when it is being used for very different meanings on different datatypes." And again, as I point out, it's not *necessary* that Python had chosen the "+" operator as its spelling of "concatenation" ... it's a good choice, but it does invite confusion with the very different use of "+" to mean "addition".
But it *is* addition as far as Python is concerned. I don't care much about that Python's "+" aren't exclusively for any particular groups; heck -- I'd be happy if we could add a ton *more* things than we currently can. It's still addition; we're not planning Haskell here. *Even if* it was a good idea to restrict "+" to commutative groups with constant-time addition (which it is not), the ship has sailed and addition in Python means what it does. [Hence sum, being “reduce(.__add__, iterable)” so to speak, makes a ton of sense on lists.]
https://en.wikipedia.org/wiki/Summation
Besides numbers, other types of values can be added as well: vectors, matrices, polynomials and, in general, elements of any additive group (or even monoid).
Yep. And summing every one of those things means *addition* and never *concatenation*. There's a reason that article begins with "Summation is the operation of adding a sequence of numbers"
As far as Python is concerned, contancation is a form of addition. Maybe not in mathematics, nor Haskell, nor C. But it is in Python, so lists in Python are additive groups.
It's also notable that concatenation of sequences doesn't form an Abelian Group. Hell, concatenation of sequences isn't even commutative.
Notable? Yeah, fine. Important? No.
Using sum() for a non-commutative operation verges on crazy. At the least, such a use is highly counter-intuitive.
Why? It's not to me. Sure, you need to know the order that it will operate, but you need to know that for "reduce" too and no-one says using "reduce" in non-commutative ways is insane.
Google's (aggregated) dictionary:
The total amount resulting from the addition of two or more numbers, amounts, or items the final aggregate; "the sum of all our troubles did not equal the misery they suffered" (a good example of where you *already know* you can sum things other than numbers*)
Summing troubles might resemble addition, metaphorically. It most certainly does not resemble concatenation.
Yet another way in which we disagree.
*Maybe* somewhere in the history of English usage you can find some oddball use where the meaning is vaguely similar to "concatenation." This is certainly not the common usage though.
Oh, again we disagree.
On Mon, Jul 15, 2013 at 10:54 AM, Joshua Landau <joshua@landau.ws> wrote:
But it *is* addition as far as Python is concerned.
Yeah, it is. I do know that. What concatenation is NOT is "addition as far as HUMANS are concerned". Now I'll take your claim as true that 'sum(list_of_lists)' is somehow intuitive to you. So I can't say it is counter-intuitive to ALL humans. But it is counter-intuitive to MANY of us, and for us readers who think of "sum" as meaning addition in the mathematical sense, code that uses this is difficult to understand. At the least it requires a double take and an extra moment to think through what is meant (i.e. via understanding what Python does internally). I will argue that sum()'ing sequences is counter-intuitive for MOST humans. It's certainly counter-intuitive to me, and I've written a Python book, taught Python, and taken graduate mathematics courses (those experiences may pull in opposite directions though). I'm also certain it would be counter-intuitive to programming learners. I imagine trying to explain it while teaching Python, and the only thing I could do is tell them to "ignore what you think sum means, and think about the internals of Python" ... that's really not a good situation to be in pedagogically.
*Even if* it was a good idea to restrict "+" to commutative groups with constant-time addition (which it is not), the ship has sailed and addition in Python means what it does.
I'm not sure it has actually "sailed." It's very well possible to restrict sum()'ing lists or tuples in much the same way strings are excluded (even though strings have an .__add__() method). If special attention were taken to *not* work for cases where it shouldn't, we could remove this counter-intuitive behavior. That said, even though I think it is *weird* to sum a non-commutuative or non-associative operation, the runtime checks to figure out whether some custom type was such would be either outright impossible or needlessly time-consuming.
As far as Python is concerned, contancation is a form of addition. Maybe not in mathematics, nor Haskell, nor C. But it is in Python, so lists in Python are additive groups.
No, lists are not additive groups! They do satisfy closure, associativity, and an identity element of []. However, there's no invertibility on lists. So even without considering commutativity, lists fail as groups.
Why? It's not to me. Sure, you need to know the order that it will operate, but you need to know that for "reduce" too and no-one says using "reduce" in non-commutative ways is insane.
Huh?! I have no expectation generically that a sequence must be reduce'd by a commutative (nor associative) operation. That's another big way that reduce is different from sum. For example, if someone had a function "fastsum()" that took an initial pass to strike out inverse elements, that might be a reasonable approach. I'm sure it won't speed up adding ints or floats, but maybe someone has a complicated "numeric-like" type where "+" is expensive and recognition of inverse elements is cheap. Possibly this optimization would be a perfectly sensible approach to a special ?sum() function. (No, I don't think the generic sum() should be engineered to do this). This concept makes no sense whatsoever thinking about reduce(operator.add, ...) because reduce() itself doesn't make sense that way. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 15 July 2013 19:54, David Mertz <mertz@gnosis.cx> wrote:
On Mon, Jul 15, 2013 at 10:54 AM, Joshua Landau <joshua@landau.ws> wrote:
But it *is* addition as far as Python is concerned.
Yeah, it is. I do know that.
What concatenation is NOT is "addition as far as HUMANS are concerned".
But it is (sort of). I asked my brother (under 20, above 10, not sure how much more I should say on a mailing list), who is about as not-programmer-techy as any computer user could reasonably be. I asked him to add two lists. He *concatenated them*¹. I *didn't*, mind you, ask him to *do* it, but what it meant to him, pointing out that "that doesn't make much sense" is a completely valid response. I also asked about an extension to finding the "sum" of lists, and he did not find that confusing. I hence have more evidence than you as of now :P.
Now I'll take your claim as true that 'sum(list_of_lists)' is somehow intuitive to you. So I can't say it is counter-intuitive to ALL humans. But it is counter-intuitive to MANY of us, and for us readers who think of "sum" as meaning addition in the mathematical sense, code that uses this is difficult to understand. At the least it requires a double take and an extra moment to think through what is meant (i.e. via understanding what Python does internally).
I will argue that sum()'ing sequences is counter-intuitive for MOST humans.
Nah. No-one can really extrapolate that far with this tiny data-set of conflicting beliefs, and hence I shall just ignore this point.
It's certainly counter-intuitive to me, and I've written a Python book, taught Python, and taken graduate mathematics courses (those experiences may pull in opposite directions though).
It's not counter-intuitive to me and I *haven't* done those things².
I'm also certain it would be counter-intuitive to programming learners. I imagine trying to explain it while teaching Python, and the only thing I could do is tell them to "ignore what you think sum means, and think about the internals of Python" ... that's really not a good situation to be in pedagogically.
Why don't you show it to a random new coder³? Given a fair random sample, I'll be willing to put in a fairly large non-monetary internet-pride bet that I'll win this point.
*Even if* it was a good idea to restrict "+" to commutative groups with constant-time addition (which it is not), the ship has sailed and addition in Python means what it does.
I'm not sure it has actually "sailed." It's very well possible to restrict sum()'ing lists or tuples in much the same way strings are excluded (even though strings have an .__add__() method). If special attention were taken to *not* work for cases where it shouldn't, we could remove this counter-intuitive behavior.
That has nothing (directly) to do with whether addition should be restricted to commutative groups.
That said, even though I think it is *weird* to sum a non-commutuative or non-associative operation, the runtime checks to figure out whether some custom type was such would be either outright impossible or needlessly time-consuming.
Agreed on that last part.
As far as Python is concerned, contancation is a form of addition. Maybe not in mathematics, nor Haskell, nor C. But it is in Python, so lists in Python are additive groups.
No, lists are not additive groups! They do satisfy closure, associativity, and an identity element of []. However, there's no invertibility on lists. So even without considering commutativity, lists fail as groups.
I wasn't saying group in your techy math-major way, but the general English form of the word. Fair 'nuff though, I sorta' asked for that. However, I don't get your last sentence there; what does commutativity change?
Why? It's not to me. Sure, you need to know the order that it will operate, but you need to know that for "reduce" too and no-one says using "reduce" in non-commutative ways is insane.
Huh?! I have no expectation generically that a sequence must be reduce'd by a commutative (nor associative) operation. That's another big way that reduce is different from sum.
For example, if someone had a function "fastsum()" that took an initial pass to strike out inverse elements, that might be a reasonable approach. I'm sure it won't speed up adding ints or floats, but maybe someone has a complicated "numeric-like" type where "+" is expensive and recognition of inverse elements is cheap. Possibly this optimization would be a perfectly sensible approach to a special ?sum() function. (No, I don't think the generic sum() should be engineered to do this). This concept makes no sense whatsoever thinking about reduce(operator.add, ...) because reduce() itself doesn't make sense that way.
But that only makes sense for types where you can find an inverse, and thus would be applicable only to those elements (with the API that allows you to find the inverse). You wouldn't say that because fsum takes only floats (not sure if true) that sum does not apply to integers, even though it's a *completely reasonable optimisation* (in the sense that accuracy is an optimisation). To put it another way, there are specialised reduces you can write that are faster for specific types; you could think of "list(chain.from_iterable(x))" as "reduce(operator.add, x)" optimised for lists. That doesn't mean that "reduce(operator.add, x)" is only applicable to lists. ¹ Sort of. He combined them as you would sorted Counters, where duplicate items were doubled-up on, but otherwise order was preserved. I think that is reasonably close. ² Well, I've taught Python, but hardly to the extent you are claiming. ³ Not so new that they wouldn't get, say, "map(list, list_of_tuples)", though.
On Mon, Jul 15, 2013 at 12:24 PM, Joshua Landau <joshua@landau.ws> wrote:
But it is (sort of). I asked my brother (under 20, above 10, not sure how much more I should say on a mailing list), who is about as not-programmer-techy as any computer user could reasonably be. I asked him to add two lists. He *concatenated them*¹.
Here's my experimental contribution. I cut and drew on three pieces of paper similar to the below ASCII art: ______________________ | 4 5 6 2 1 ______________________ | 6 12 13 19 100 ______________________ | 100 200 300 In particular, I put small integers on them, but for generality made the numbers sometimes out of natural sort order. I also made the lists of different lengths so that elementwise addition would pose a problem (a subject *could* decide to fill in the additive identity zero for the "missing" elements if she wanted to, but this would have to be a decision). I also placed the papers deliberately so that the left edges were not aligned (as pictured) so the notion of columns would not be forced on an informant (but not prohibited either). I found as a subject a "programming-naive" but well-educated subject in her 40s (a friend, no kidnapping of strangers off the street). I asked something worded close to the following: "Can you sum these lists? An acceptable answer would be that the question does not make sense. If it does make sense, what result do you get?" As a possible aid, I had a notepad placed nearby, in case some sort of copying operation was felt relevant (but I just made sure the notepad was on the table, I didn't say anything about whether it should or should not be used). Her answer was to write the additive sum of *each* slip of paper (list). I.e. three numbers: 18, 150, 600. In other words, she reads it as: sum([[4,5,6,2,1], [6,12,13,19,100], [100,200,300]]) == [18, 150, 600] Well, this doesn't technically mean *anything* in Python since no 'start' value is given on the left. But essentially her intuition is that it means: map(sum, [[4,5,6,2,1], [6,12,13,19,100], [100,200,300]]) Actually, that Python version is especially accurate, because what my informant actually said was "Do you want me to actually make the calculations?! That's what I'd do!" So much as with 3.x map, she didn't actually consume the iterator until needed. ¹ Sort of. He combined them as you would sorted Counters, where
duplicate items were doubled-up on, but otherwise order was preserved. I think that is reasonably close.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 15 July 2013 22:36, David Mertz <mertz@gnosis.cx> wrote:
On Mon, Jul 15, 2013 at 12:24 PM, Joshua Landau <joshua@landau.ws> wrote:
But it is (sort of). I asked my brother (under 20, above 10, not sure how much more I should say on a mailing list), who is about as not-programmer-techy as any computer user could reasonably be. I asked him to add two lists. He *concatenated them*¹.
Here's my experimental contribution.
Thank you, this was interesting (although I don't particularly agree with your interpretation).
I found as a subject a "programming-naive" but well-educated subject in her 40s (a friend, no kidnapping of strangers off the street). I asked something worded close to the following:
"Can you sum these lists? An acceptable answer would be that the question does not make sense. If it does make sense, what result do you get?"
Unfortunately, that's not, in my opinion, an accurate way to phrase the question. If I were to say "can you mow these lawns" or "can you catch these criminals" or "can you run these tests", I would be asking for something along the lines of "map(verb, noun)". That's what she dutifully did, as you note. You should have asked "can you sum this list [note the singular noun] of lists?" or somesuch. Because that's odd linguistically, you can say instead "can you sum these lists together", although that's obviously a bit of a rigged statement. You're free to think of a better compromise. Apologies for my fussiness.
As a possible aid, I had a notepad placed nearby, in case some sort of copying operation was felt relevant (but I just made sure the notepad was on the table, I didn't say anything about whether it should or should not be used).
Copying your experimental choices, I actually think I rigged mine a bit by giving a non-numeric values inside my lists, making more imaginative choices much less likely. Kudos for the well-run experiment, too (ignoring my issue of phrasing).
Her answer was to write the additive sum of *each* slip of paper (list). I.e. three numbers: 18, 150, 600.
In other words, she reads it as:
sum([[4,5,6,2,1], [6,12,13,19,100], [100,200,300]]) == [18, 150, 600]
Well, this doesn't technically mean *anything* in Python since no 'start' value is given on the left. But essentially her intuition is that it means:
map(sum, [[4,5,6,2,1], [6,12,13,19,100], [100,200,300]])
Actually, that Python version is especially accurate, because what my informant actually said was "Do you want me to actually make the calculations?! That's what I'd do!" So much as with 3.x map, she didn't actually consume the iterator until needed.
:). This just proves that Python is a human.
On Mon, Jul 15, 2013 at 2:56 PM, Joshua Landau <joshua@landau.ws> wrote:
"Can you sum these lists? An acceptable answer would be that the question does not make sense. If it does make sense, what result do you get?"
You should have asked "can you sum this list [note the singular noun] of lists?" or somesuch. Because that's odd linguistically, you can say instead "can you sum these lists together", although that's obviously a bit of a rigged statement. You're free to think of a better compromise.
I agree that the result is likely to depend a lot on the nuance of how it is worded to a native speaker. I just asked my friend--who is now, however, no longer experimentally naive--what she would have said had I asked "Can you sum these lists together?" I feel like "Can you sum this list of lists?" would just sound perverse to a non-programmer, although I agree that if they thought about it slowly they'd be more likely to come up with concatenation (but I still think most wouldn't do so). Her answer was that she would have produced a single number that was the total of all the numbers in all the lists (or equivalently, the sum of the three map(sum, ...) items). I didn't actually lay out the papers again or make her perform the additions though :-). Of course, we're still talking about one more data point really. Although I have intuitions, there are millions or billions of non-programmers or aspiring programmers, and obviously answers would vary. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 07/15/2013 04:36 PM, David Mertz wrote:
On Mon, Jul 15, 2013 at 12:24 PM, Joshua Landau <joshua@landau.ws <mailto:joshua@landau.ws>> wrote:
But it is (sort of). I asked my brother (under 20, above 10, not sure how much more I should say on a mailing list), who is about as not-programmer-techy as any computer user could reasonably be. I asked him to add two lists. He *concatenated them*¹.
Here's my experimental contribution. I cut and drew on three pieces of paper similar to the below ASCII art:
______________________ | 4 5 6 2 1
______________________ | 6 12 13 19 100
______________________ | 100 200 300
In particular, I put small integers on them, but for generality made the numbers sometimes out of natural sort order. I also made the lists of different lengths so that elementwise addition would pose a problem (a subject *could* decide to fill in the additive identity zero for the "missing" elements if she wanted to, but this would have to be a decision). I also placed the papers deliberately so that the left edges were not aligned (as pictured) so the notion of columns would not be forced on an informant (but not prohibited either).
I found as a subject a "programming-naive" but well-educated subject in her 40s (a friend, no kidnapping of strangers off the street). I asked something worded close to the following:
"Can you sum these lists? An acceptable answer would be that the question does not make sense. If it does make sense, what result do you get?"
As a possible aid, I had a notepad placed nearby, in case some sort of copying operation was felt relevant (but I just made sure the notepad was on the table, I didn't say anything about whether it should or should not be used).
Her answer was to write the additive sum of *each* slip of paper (list). I.e. three numbers: 18, 150, 600.
Nice test. So what the discussion is trying to determine is, is it better to treat sequences as fundamentally different, than values, in python. Often times in human language, we need to give more information to get the desired point across. One way is to use a common simpler expression, with another simple hint. OR the other way is to use more specific language that doesn't require a hint. The 'hint' can be subtle, and is usually not included in examples we use when comparing human language to computer language. That makes the argument for a simpler term a bit stronger. So we have these two approaches... (1) In the case of sum(x, start), the hint would be the variable name of either x, or start. Without that hint, you need to scan forward or backwards in the source code to figure out what sum() will actually do. (2) The more specific approach would be to have two functions that don't need hints because their purpose is more limited. sum_values(x) # Not really needed as sum() is good at this already. sum_iters(x) # Fast sum of iters.. Because they are more limited in scope, they can be optimised to a greater deal, and made to work without a start value. Sergey's patch does increase the speed quite a bit, (over the other suggestions), and combining lists is fairly common, so I do thing it should be used in either a new function, or in sum(), depending on how the developers feel about weather or not it is better to create more separation between how sequences and values are treated. Although, if one of the other alternatives can be made as fast, then that would be good too. So ... + 1 Add specialised sum_iters(), (based on the sum() patch.) + .5 increase other options speed.. (not easy or even possible) + .25 Patch sum, and document it's use for sequences. [1] [1] Assuming this can be done with no backwards compatibility or new side effects. Lots of new tests should be added for this. By far, the easiest and least problematic choice is to add a new function. Cheers, Ron
On 15 July 2013 23:27, Ron Adam <ron3200@gmail.com> wrote:
So ...
+ 1 Add specialised sum_iters(), (based on the sum() patch.)
You mean chain.from_iterable? That's the fastest we're getting for iterables. Maybe sum_lists() could be faster, but then we're into "if you need that niche and you need it that fast, write it yourself" territory.
+ .5 increase other options speed.. (not easy or even possible) + .25 Patch sum, and document it's use for sequences. [1]
[1] Assuming this can be done with no backwards compatibility or new side effects. Lots of new tests should be added for this.
But then it doesn't duck-type well ⇒ people should avoid using it ⇒ the original change just becomes an attractive nuisance chain.from_iterable doesn't have this problem.
By far, the easiest and least problematic choice is to add a new function.
On 07/15/2013 05:40 PM, Joshua Landau wrote:
On 15 July 2013 23:27, Ron Adam<ron3200@gmail.com> wrote:
So ...
+ 1 Add specialised sum_iters(), (based on the sum() patch.) You mean chain.from_iterable?
No, I mean a new function written in C, which writes (appends) the values directly into a new (or start) sequence. Chain from_iterable builds a generator and yields the items out. That's not going to be as fast, but it does use much less memory in many situations.
That's the fastest we're getting for iterables. Maybe sum_lists() could be faster, but then we're into "if you need that niche and you need it that fast, write it yourself" territory.
If it was a python function writen in python, this would be true, but as a builtin C function, it could be faster. Common built-in types could be optimised in sum_iters(), just as Sergery has done in the patch for sum(). One of the main sticking points in the discussion is weather or not sum() should be a recommended way of summing non-number types. Adding a new function supports the (current) view that sum shouldn't be recommended to sum non-number types. (Although it would still work for backwards compatibility reasons.)
+ .5 increase other options speed.. (not easy or even possible) + .25 Patch sum, and document it's use for sequences. [1]
[1] Assuming this can be done with no backwards compatibility or new side effects. Lots of new tests should be added for this.
But then it doesn't duck-type well ⇒ people should avoid using it ⇒ the original change just becomes an attractive nuisance
That's why the tests are needed, and why it's not my first choice.
chain.from_iterable doesn't have this problem.
I agree. It's about using the right tool for the right job. Not weather one is better than the other.
By far, the easiest and least problematic choice is to add a new function.
How about a decision tree? <Recommend sum() for summing iterables?> # <<<< Are we still stuck here? (YES) patch sum. Add more tests to cover questionable cases. <Does it pass all tests?> (YES) [GOTO 2] (NO) [GOTO 1] A: (NO) # I think we should be here. Crate new function based on the sum() patch. create many tests. <Does it pass tests> (YES) <Is it significantly faster or better in some way than other alternatives?> {AND} <Is there enough use cases to support adding it? [*1]> (YES) [GOTO B] (NO) [DONE] # Good idea, but not worth doing. (NO) [DONE] # Something wrong with idea. B: Add docs to patch. <Ask for inclusion?> (YES) # Accepted! Add news entry if needed to patch. apply patch [DONE] # Yay (NO) rejected # Goto "A" if changes are needed. [DONE] # We tried. [*2] [*1] A preferred way to verify this is to find places in python's library where it helps make the code better in some way. Finding enough of these examples is a good indication it's on the right track and helps significantly in convincing others it's worth doing. [*2] It might be determined that the patch would cause problems down the road, be difficult to maintain, or there might be some other competing idea that would be preferred. Of course there would be feedback cycles in most of these steps and others might put things in a different order, but it's pretty much follows the standard path most patches are done on the tracker. Lets help Sergey through this process and not be too quick to reject his ideas. Cheers, Ron
On Mon, Jul 15, 2013 at 7:12 PM, Ron Adam <ron3200@gmail.com> wrote:
+ 1 Add specialised sum_iters(), (based on the sum()
You mean chain.from_iterable?
No, I mean a new function written in C, which writes (appends) the values directly into a new (or start) sequence. Chain from_iterable builds a generator and yields the items out. That's not going to be as fast, but it does use much less memory in many situations.
If a new function could *actually* be significantly faster than chain.from_iterable(), I think it would be reasonable to have. However, if writing something new as basically an alias for 'list(chain(...))' only gets us, say 10% speedup, I think nothing should be included. But PLEASE, don't call such a new function sum_iters(). The obviously correct name for such a thing is 'concat()'. This is what I've argued a bunch of times, but let's just call concatenation by its actual name rather than try to squint in just the right way to convince ourselves that "summation" is the same concept. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 16 July 2013 04:22, David Mertz <mertz@gnosis.cx> wrote:
On Mon, Jul 15, 2013 at 7:12 PM, Ron Adam <ron3200@gmail.com> wrote:
+ 1 Add specialised sum_iters(), (based on the sum()
You mean chain.from_iterable?
No, I mean a new function written in C, which writes (appends) the values directly into a new (or start) sequence. Chain from_iterable builds a generator and yields the items out. That's not going to be as fast, but it does use much less memory in many situations.
If a new function could *actually* be significantly faster than chain.from_iterable(), I think it would be reasonable to have. However, if writing something new as basically an alias for 'list(chain(...))' only gets us, say 10% speedup, I think nothing should be included.
I'm not convinced. I have three reasons, and I have full faith in all three. 1) Ignoring speed, I don't believe there are *any* use-cases where concat(...) *isn't worse* than list(chain.from_iterable(...)), excluding fabricated ones that have never happened. 2) I'm not convinced this is a bottleneck for a significant number of people; chain is much faster than most constructs we have already, so it'd be off for the chaining to be the slowest part 3) This belongs on PyPi, not stdlib, as it is niche and we can already do the same thing with the same asymptotic performance. Blist gives *way* more speed advantages yet that's been rejected from stdlib so I disagree that this can cross the barrier. I realise blist was rejected for the additional reason of trying to replace our normal lists, but it could've gotten into stdlib so that doesn't discount the analogy.
But PLEASE, don't call such a new function sum_iters(). The obviously correct name for such a thing is 'concat()'. This is what I've argued a bunch of times, but let's just call concatenation by its actual name rather than try to squint in just the right way to convince ourselves that "summation" is the same concept.
I agree, because a summation that only works on lists is really just concat after-all.
OK... I have an update to my experiment. I asked by email another friend of mine (also middle-aged adult, well-educated, a doctorate in humanities FWIW [like me]), this one has had an introductory exposure to programming, and in Python specifically, but just one semester-long intro course (the wonderful edX MIT Intro to CS, a few months ago). This is what I sent her (I also attached my report of the physical experiment that I subjected our mutual friend to): Before you read the long post below, just consider this part up here and tell me what you think. Consider this Python code: list_of_lists = [ [4, 5, 6, 2, 1], [6, 12, 13, 19, 100], [100, 200, 300] ] result = sum(list_of_lists) You may not have used the built-in function sum before. But here is an example of its most common usage:
sum([10,20,30,40]) 100
So the question is: what is your intuition about what 'result' *should* be? If you happen to know that Python does one thing, but feel like it *should* do a different thing, that is very interesting too. One answer that you might feel is that it doesn't make sense and there should be an exception raised on that line. For example, this happens:
sum(['foo','bar']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'int' and 'str'
Her reply: "I would want an exception raised cause it doesn't make sense." Yes, I know the code actually *does* raise an exception already; but I spoke with her to, and her answer isn't "it should raise an exception because the start= is missing" ... at a beginner level she explicitly means "It doesn't make sense to sum lists, only numbers." Actually, this makes me think even more strongly that an identity element for a type is necessary for sum() to make ANY sense. You need to have an implied identity as the starting state... yes, in principle we could let sum infer the list identity [], but for other types where that doesn't even exist, it seems even worse... even where it will work for technical reasons.
On 19/07/13 04:42, David Mertz wrote:
OK... I have an update to my experiment. I asked by email another friend of mine (also middle-aged adult, well-educated, a doctorate in humanities FWIW [like me]), this one has had an introductory exposure to programming, and in Python specifically, but just one semester-long intro course (the wonderful edX MIT Intro to CS, a few months ago).
While we're swapping anecdotes instead of doing proper UI testing *grin* (or perhaps API testing would be a better term), I asked two of my programmers at work what they would do if they had a list of lists and needed to concatenate them into a single list. Both are moderately familiar with Python, although one (call him R) is more familiar with Ruby and Objective-C. R started with some Ruby code, which was basically the equivalent of reduce, then when I told him that lists supported + for concatenation, said he'd try sum(). At this point, the other programmer (call him M) rolled his eyes and said "You've got to be kidding!". He did not know that lists use + for concatenation, nor that sum() supports lists, and thought that both were terrible ideas and that he'd use a for-loop with extend. When I hinted that sum() might have performance issues, R was at first incredulous, then once he understood the reason for the performance issues, shrugged and said something along the lines of "Oh well, I'd use sum(), and if it turned out to be too slow for my data, I'd refactor it to use a for-loop. No big deal." Neither thought that they would gain much if sum() was sped up, M because he wouldn't use it to join lists regardless of speed, and R because he thought it would be unlikely that he'd ever be in a position of needing to join sufficient numbers of lists to really make a difference, but if he was, refactoring was such a trivial exercise that he didn't care one way or another. -- Steven
On 19 July 2013 05:07, Steven D'Aprano <steve@pearwood.info> wrote:
On 19/07/13 04:42, David Mertz wrote:
OK... I have an update to my experiment. I asked by email another friend of mine (also middle-aged adult, well-educated, a doctorate in humanities FWIW [like me]), this one has had an introductory exposure to programming, and in Python specifically, but just one semester-long intro course (the wonderful edX MIT Intro to CS, a few months ago).
While we're swapping anecdotes instead of doing proper UI testing *grin* (or perhaps API testing would be a better term), I asked two of my programmers at work what they would do if they had a list of lists and needed to concatenate them into a single list. Both are moderately familiar with Python, although one (call him R) is more familiar with Ruby and Objective-C.
Thank you both, again. One conclusion I think it's safe to take from this is that it is *not* a clear-cut issue as many people, myself included, had assumed. I think it's fair to say that people claiming (Stefan Behnel in this case, because it was the first quotation I found):
IMHO, the reason why sum() supports other input types than numbers is that in Python 2, at the time when it was implemented, there was no clear definition of a "number"
or similar have missed this point. My claims that it obviously does make sense (although it's not necessarily a good idea) was equally wrong in this regard. Also,
Neither thought that they would gain much if sum() was sped up, M because he wouldn't use it to join lists regardless of speed, and R because he thought it would be unlikely that he'd ever be in a position of needing to join sufficient numbers of lists to really make a difference, but if he was, refactoring was such a trivial exercise that he didn't care one way or another.
I agree in large with R, except that I'd never opt for sum() over chain.from_iterable().
Joshua Landau, 19.07.2013 11:05:
One conclusion I think it's safe to take from this is that it is *not* a clear-cut issue as many people, myself included, had assumed. I think it's fair to say that people claiming (Stefan Behnel in this case, because it was the first quotation I found):
IMHO, the reason why sum() supports other input types than numbers is that in Python 2, at the time when it was implemented, there was no clear definition of a "number"
or similar have missed this point.
Not sure what point you mean exactly here, but as I said before, using sum() on lists of lists seems like one of those cool little hacks at first, but actually isn't. It's really just a bad idea. Even if you personally don't consider it weird that lists can be summed up, it definitely violates the "don't make me think" principle for many people. Rephrasing the quote above, I actually consider sum(lists) an implementation artefact. Definitely not a feature, and most likely not even deliberate. Stefan
On 7/15/2013 2:54 PM, David Mertz wrote:
What concatenation is NOT is "addition as far as HUMANS are concerned".
Have you really never added together two or more shopping lists? Have you never lengthened a string (such as a kite string) or rope by adding (concatenating) another piece? Have you never added two piles of papers together by piling one on top of the other (and order matters here). As I posted before, What HUMANS do not do is 'concatenate' things. Nor is addition by concatenation, as in the examples above, usually considered summation. Summation usually implies condensation. Summing multiple numbers produces one number. With a mixture of + and - munbers, the sum may even be less in magniture than the largest. Summing up an hour meeting should produce a statement of, say, a minute or less. A concatenation of everything said is not a summation. Concatenation does not 'condense' or 'reduce', and that, I think, is why some do not see sum as applying to sequence joining. In Peano arithmetic, in math, addition of numbers (counts) amounts to concatenation of sequences of successor operators. -- Terry Jan Reedy
On Jul 15, 2013 David Mertz wrote:
It seems to me that in order to make sum() look more attractive, Sergey presents ugly versions of alternative ways to (efficiently) concatenate sequences.
No, I actually tried to choose the most popular and obvious ones. Of course there're more. E.g. I tested 6 [1]. Do you think that people often know *args notation or add infinite lists? ;) But anyway all of them are just workarounds. Better workarounds, maybe, but still workarounds, and their existence don't fix sum(). Same as existence of wget or curl does not fix the bug in httplib [2].
flat = [] map(flat.extend, list_of_lists) Using map() for a side effect is slightly wrong, but this is short, readable, and obvious in purpose.
Ehm, you were joking when you called that obvious, right? ;)
Now I'll take your claim as true that 'sum(list_of_lists)' is somehow intuitive to you. So I can't say it is counter-intuitive to ALL humans.
So what? Lot's of things are counter-intuitive in programming languages, including python. For example, don't you think that a = a + b and a += b should be same, and it's counter-intuitive to have them different. Yet you can have them different in python.
But it is counter-intuitive to MANY of us, and for us readers who think of "sum" as meaning addition in the mathematical sense, code that uses this is difficult to understand.
If it makes no mathematical sense then don't apply math to it. :) In mathematical sense something like: x = x + 1 is completely counter-intuitive. There's no "x" for which that equation could be true. Yet, this line is obvious to every programmer. I mean, a function does not have to be intuitive for mathematicians in order to be a useful tool for programmers. :) And the point is: they already use it, so we already have a bug, I'm just searching for the best way to fix it. Having a broken tool on your shelves and saying "don't take that, it's broken" to everyone every time is not a solution.
I will argue that sum()'ing sequences is counter-intuitive for MOST humans. It's certainly counter-intuitive to me, and I've written a Python book, taught Python, and taken graduate mathematics courses (those experiences may pull in opposite directions though).
Be careful about things like "I've written books, taught Python..." because someone may catch you on some stupid typo (e.g. a missing * in chain() call) and ask what are you teaching if you don't know basics yourself. :)
I'm not sure it has actually "sailed." It's very well possible to restrict sum()'ing lists or tuples in much the same way strings are excluded (even though strings have an .__add__() method). If special attention were taken to *not* work for cases where it shouldn't, we could remove this counter-intuitive behavior.
It's actually easier to fix it, than to restrict it. To fix it you just need to apply a patch [3]. You can do that even for Python2, since it introduces no behaviour change. But restricting it means removing a feature, it's a major change that breaks backward compatibility. —— [1]http://mail.python.org/pipermail/python-ideas/2013-July/022065.html [2]http://bugs.python.org/issue6838 [3]http://bugs.python.org/file30897/fastsum-special-tuplesandlists.patch
On Mon, Jul 15, 2013 at 8:21 PM, Sergey <sergemp@mail.ru> wrote:
Do you think that people often know *args notation or add infinite lists? ;)
I think people now know *args notation, yes (and yes, I make typos notwithstanding being a professional writer). As for infinite lists, I actually had in mind a fairly specific and ordinary example. I can imagine that one has one iterator the yields files, which themselves iterate over lines within files. Maybe not *infinite* quite, but one might have a lot of files (even a lot matching some spec) on a slow network drive, and each of those files might have a lot of lines. One plausible thing to do is keep searching until we find "the right line" ... but we don't want to actually open millions of files if we don't need to. E.g.: for line in chain.from_iterable(lines_in_many_files): got_it = check_the(line) if got_it: break -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 07/14/2013 02:26 PM, Sergey wrote:
* Sum is not obvious (for everyone) way to add lists, so people should not use it, as there're alternatives, i.e. instead of - sum(list_of_lists, []) one can use: - reduce(operator.iadd, list_of_lists, []) - list(itertools.chain.from_iterable(list_of_lists)) - result = [] for x in list_of_lists: result.extend(x)
I have nothing against increasing sum()'s speed. I think the real issue is having a very fast way to sum non-numbers. You could copy the code from sum() and make a function with a different name that specialises in non-number addition. That would not have any backwards compatibility issues. Do you think you could use what you learned with sum to make chain, or a new fold function faster? Both of those have some advantages over sum(), but aren't as fast. If you could make those faster, then that would be very nice. The advantages are.. Chain works with most iterables and uses much less memory in some cases. A new fold function would do quite a lot more depending on the operator passed to it. It may be possible to speed up some common cases that use methods on builtin types. Cheers, Ron
On Jul 15, 2013 Ron Adam wrote:
You could copy the code from sum() and make a function with a different name that specialises in non-number addition. That would not have any backwards compatibility issues.
That would just add one more workaround, but would not solve the problem. Sum is already O(N*N) for many containers, and adding more workarounds would not make it any faster. As for me using "+" e.g. to add strings looks rather obvious. Adding lists looks similar to adding strings. And sum() looks like a lot of "+"es. I mean I see nothing strange in using sum for list of lists. It as natural as using max() to find the "largest" word in a list of strings — a nice and expected feature. But *if* in some distant python version we would have separate operations for numerical addition and sequence concatenation, then we might split our current sum() into two functions: sum() and concat(). But to do that we must first solve the problem for sum(). Or we'll have exactly same problem with concat() anyway.
Do you think you could use what you learned with sum to make chain, or a new fold function faster?
I'm not sure that chain or reduce/fold could benefit from them, since all suggestions are about containers and "+" operator, except, maybe, suggestion #2, since everybody are expected to benefit from it.
The advantages are..
Chain works with most iterables and uses much less memory in some cases.
That's the main difference: sum returns a container, which is complete and ready to use, while chain returns some weird type. :) It does not use a container, so you can't have any container-specific optimization in it. I.e. imagine: x = [[1,2], [3,4], 5, 6] now try: for i in sum(x, []): print(i) and: for i in chain.from_iterable(x): print(i) in first case you'll get an error instantly, while in second case you'll have print called several times before you get an error.
A new fold function would do quite a lot more depending on the operator passed to it. It may be possible to speed up some common cases that use methods on builtin types.
I'm not sure what would be a common case for it. It looks like a renamed reduce. What's a common case for reduce?
On 07/15/2013 10:58 PM, Sergey wrote:
On Jul 15, 2013 Ron Adam wrote:
You could copy the code from sum() and make a function with a different name that specialises in non-number addition. That would not have any backwards compatibility issues.
That would just add one more workaround, but would not solve the problem. Sum is already O(N*N) for many containers, and adding more workarounds would not make it any faster.
Sorry for not being clearer. I meant to use your patch for sum as a basis to write a new function that is efficient for containers.
As for me using "+" e.g. to add strings looks rather obvious. Adding lists looks similar to adding strings. And sum() looks like a lot of "+"es. I mean I see nothing strange in using sum for list of lists. It as natural as using max() to find the "largest" word in a list of strings — a nice and expected feature. But*if* in some distant python version we would have separate operations for numerical addition and sequence concatenation, then we might split our current sum() into two functions: sum() and concat().
But to do that we must first solve the problem for sum(). Or we'll have exactly same problem with concat() anyway.>
Or.. you can solve the problems for concat(), and later when a new major version of python is released, we might be able to make sum() use the same techniques. It can work both ways I think. I got the impression that you already know how, or several possible ways to solve the problem with sum(), but are running up against some backward compatibility issues? ie.. can't use __iadd__ in place of __add__. And also some ideological issues about what others think sum() should or shouldn't do. And how much should or should not be special cased. A new function would allow you to write the function how you think it will work the best. And there would not be any of those issues. As far as the meaning of the word "sum" goes and how it's used in english, It's not the most important issue to me. Get something to work, and demonstrate it works and is useful... then we can have a discussion about what to call it. Take a poll, and if it's still not decided, ask one of the core developers to choose from the top name candidates. That would work fine for me. What's important to me is that I have a way to write nice programs. Once I used whatever function a few times, it's name takes on the meaning of what it does. I just want something I can easily find if I need to look up the details for it. Cheers, Ron
I don't understand why people try to use sum() for anything other than a sequence of numbers. If you want to flatten a list, use a flatten function. Here's a performance comparison of a few possible implementations: http://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-pytho... Pick one and you're fine :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 15 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-07-16: Python Meeting Duesseldorf ... tomorrow ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg, 15.07.2013 09:11:
I don't understand why people try to use sum() for anything other than a sequence of numbers.
If you want to flatten a list, use a flatten function.
+1 And, while we're at it, we can just as well ask why sum([[1,2,1], [2,1,2], [3,4,5,[6,7]], [[4,3], 1]]) doesn't return 42. IMHO, this would make a lot more sense than returning a concatenated list. Stefan
On 18/07/13 04:07, Stefan Behnel wrote:
M.-A. Lemburg, 15.07.2013 09:11:
I don't understand why people try to use sum() for anything other than a sequence of numbers.
If you want to flatten a list, use a flatten function.
+1
And, while we're at it, we can just as well ask why
sum([[1,2,1], [2,1,2], [3,4,5,[6,7]], [[4,3], 1]])
doesn't return 42.
Why would it return 42? List addition is not defined as element-by-element addition, it is defined as concatenation. To put it another way, since [1, 2, 1] + [2, 1, 2] returns [1, 2, 1, 2, 1, 2], not 9, sum() returns the same.
IMHO, this would make a lot more sense than returning a concatenated list.
I'm afraid that this doesn't make sense to me. What you're showing is neither list addition as it is defined now (concatenation), nor element-by-element addition, both of which would raise exceptions, but a recursive flatten immediately followed by a summation. And this is just after you agreed that if you want to flatten a list, you should call a flatten function, not sum! Regardless of how sum() might have been defined, or should have been designed, or whether list addition should be defined as element-by-element addition rather than concatenation, sum() has been around for about ten years now, and we're constrained by backwards compatibility. If anyone wants to seriously argue for breaking backwards compatibility, please don't argue here until you have got at least the beginnings of a PEP written. -- Steven
Steven D'Aprano, 18.07.2013 04:52:
On 18/07/13 04:07, Stefan Behnel wrote:
M.-A. Lemburg, 15.07.2013 09:11:
I don't understand why people try to use sum() for anything other than a sequence of numbers.
If you want to flatten a list, use a flatten function.
+1
And, while we're at it, we can just as well ask why
sum([[1,2,1], [2,1,2], [3,4,5,[6,7]], [[4,3], 1]])
doesn't return 42.
Why would it return 42? List addition is not defined as element-by-element addition, it is defined as concatenation. To put it another way, since [1, 2, 1] + [2, 1, 2] returns [1, 2, 1, 2, 1, 2], not 9, sum() returns the same.
IMHO, this would make a lot more sense than returning a concatenated list.
I'm afraid that this doesn't make sense to me.
The point I was trying to make is that a "sum" is well defined for numbers but not for lists, so even a recursive sum over lists of lists makes more sense than calling sum() on a sequence of lists and expecting it to concatenate those lists. I really don't see the link between summing up items and concatenating them. If the function was called "concatenate()", then, yes, ok, I'd expect it to concatenate lists that I feed into it. But it's called sum(). And it's called that because the name describes what it does. The mere attempt to "sum" lists is just so misguided that it's really not worth any discussion, especially not about making it easier! Just to be clear, I definitely wasn't proposing to extend sum() to support recursive summing. Stefan
On 18 July 2013 16:48, Stefan Behnel <stefan_ml@behnel.de> wrote:
The point I was trying to make is that a "sum" is well defined for numbers but not for lists,
False.
so even a recursive sum over lists of lists makes more sense than calling sum() on a sequence of lists and expecting it to concatenate those lists.
I disagree.
I really don't see the link between summing up items and concatenating them. If the function was called "concatenate()", then, yes, ok, I'd expect it to concatenate lists that I feed into it. But it's called sum(). And it's called that because the name describes what it does. The mere attempt to "sum" lists is just so misguided that it's really not worth any discussion, especially not about making it easier!
You're discussing it, and it isn't misguided.
Sergey, 14.07.2013 21:26:
It's worth to note, that sum() is one of the most commonly suggested options to add lists [1], despite usually someone also comes and says that it may be slow. This means, that people at least often try to use sum() for lists. That case was also explicitly mentioned in comments to sum() sources. So the problem is not hypothetical.
IMHO, the reason why sum() supports other input types than numbers is that in Python 2, at the time when it was implemented, there was no clear definition of a "number", so it was impossible to correctly distinguish between numberish input and input that should be rejected. I guess that explicitly rejecting arbitrary builtin types was considered overly, well, arbitrary, so it wasn't done at the time. If sum() was designed today, with the ABCs in Python 3 available, it would be easy to restrict it to meaningful input types, i.e. those that declare themselves as being numbers. However, changing this now would arbitrarily break existing code, so it's unlikely that this will happen. So I guess we'll just have to live with that little wart that sum() doesn't always reject stupid input. We can still tell those who try it that what they do is stupid, and show them better ways to do it. There were lots of suitable examples of better code presented over the last decade or so, some of which showed up in the recent threads again. Stefan
participants (10)
-
David Mertz -
Ethan Furman -
Joshua Landau -
M.-A. Lemburg -
Paul Moore -
Ron Adam -
Sergey -
Stefan Behnel -
Steven D'Aprano -
Terry Reedy