The non-obvious nature of str.join (was Re: sum(...) limitation)
(switching from python-dev to python-ideas) FWIW, I don't consider str.join *or* sum with an empty string as the starting point to be particularly intuitive ways of joining iterables of strings. str.join was invented before we had keyword-only arguments as a common construct, and before print became an ordinary function that accepted a "sep" keyword-only argument. I'd be interested in seeing a concrete proposal for a "concat" builtin that accepted a "sep" keyword only argument. Even if such a PEP ends up being rejected, it would hopefully help cut short the *next* potentially interminable thread on the topic by gathering the arguments for and against in a more readily accessible place. Regards, Nick. On 10 Aug 2014 18:26, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Alexander Belopolsky writes:
On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull <stephen@xemacs.org
wrote:
All the suggestions I've seen so far are (IMHO, YMMV) just as ugly as the present situation.
What is ugly about allowing strings? CPython certainly has a way to to make sum(x, '')
sum(it, '') itself is ugly. As I say, YMMV, but in general last I heard arguments that are usually constants drawn from a small set of constants are considered un-Pythonic; a separate function to express that case is preferred. I like the separate function style.
And that's the current situation, except that in the case of strings it turns out to be useful to allow for "sums" that have "glue" at the joints, so it's spelled as a string method rather than a builtin: eg, ", ".join(paramlist).
Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics).
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 8/10/2014 8:33 AM, Nick Coghlan wrote:
(switching from python-dev to python-ideas)
FWIW, I don't consider str.join *or* sum with an empty string as the starting point to be particularly intuitive ways of joining iterables of strings.
I think sum(iofs, '') is obvious and I do not understand the determination to not allow it and implement it as ''.join(iofs). Perhaps that is because I think '+' is the *right* choice for string concatenation. Addition of tallies (base 1 numbers) is concatenation. For tallies, or strings, r and s, len(r+s) == len(r) + len(s). (That this is not true of sets is a reason to not use '+ for set union, which Python doesn't.) (I also think sum(it_of_lists, somelist), which is allowed, should be implemented as somelist.extend(it_of_lists), but that is another issue.)
str.join was invented before we had keyword-only arguments as a common construct, and before print became an ordinary function that accepted a "sep" keyword-only argument.
The reason sep in print has to be keyword-only is because print takes an indefinite of main arguments rather than an iterable of items to print. This is appropriate because we generally print a few items that are not already in a collection, because collections of strings can be .joined, and because collections can have, and builtin collections do have, methods to join the string representations of their members. Print, str and .__str__ methods, and str.join works together wonderfully. The start paramenter of sum does not need to be keyword only because sum takes one iterable of items to sum. This is appropriate because a few items to sum, not in a collection, can be handled by '+', and sum is needed for large collections of items already in a collection.
I'd be interested in seeing a concrete proposal for a "concat" builtin that accepted a "sep" keyword only argument. Even if such a PEP ends up being rejected, it would hopefully help cut short the *next* potentially interminable thread on the topic by gathering the arguments for and against in a more readily accessible place.
For the against list: New builtins should add something more than a pure synonym; "concat(iofs, sep=joiner)" is harder (6 more letters), not easier, to write than "joiner.join(iofs)". -- Terry Jan Reedy
On Sunday, August 10, 2014 11:26 AM, Terry Reedy <tjreedy@udel.edu> wrote:
For the against list: New builtins should add something more than a pure synonym; "concat(iofs, sep=joiner)" is harder (6 more letters), not easier, to write than "joiner.join(iofs)".
If concat instead meant "joiner.join(map(type(joiner), iofs))", that might be more useful than just a synonym. I've seen many novices try to figure out how to join up their list of ints. Also, it could conceivably be faster for user string-like classes, especially mutable ones (because then, as long as they had slice replacement, everything else would happen in C, whereas if they have to write a custom join, it will be looping in Python). Also, being a new function, it might be reasonable for concat to handle iterators differently from sequences (exponential expansion rather than preallocating). I've seen a couple people asking why ''.join(genexpr) uses twice as much memory as they expected (although far fewer than the joining-up-ints questions). Of course any (or all) of those also might well be a bad idea in its own right. But a new function gives us a chance to think through how it should work—and then, if it turns out it should work exactly like join, the answer may be "just a synonym, reject it", but if not, it might be reasonable to add it and gradually phase out join.
On Aug 10, 2014 9:51 PM, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
On Sunday, August 10, 2014 11:26 AM, Terry Reedy <tjreedy@udel.edu> wrote:
For the against list: New builtins should add something more than a pure synonym; "concat(iofs, sep=joiner)" is harder (6 more letters), not easier, to write than "joiner.join(iofs)".
If concat instead meant "joiner.join(map(type(joiner), iofs))", that
might be more useful than just a synonym. I've seen many novices try to figure out how to join up their list of ints. Couldn't this also be handled with a keyword argument to join? For example "joiner.join(iofs, convert=True)".
Also, it could conceivably be faster for user string-like classes, especially mutable ones (because then, as long as they had slice replacement, everything else would happen in C, whereas if they have to write a custom join, it will be looping in Python).
Also, being a new function, it might be reasonable for concat to handle iterators differently from sequences (exponential expansion rather than
This seems like something an abstract base class could take care of. preallocating). I've seen a couple people asking why ''.join(genexpr) uses twice as much memory as they expected (although far fewer than the joining-up-ints questions). Isn't this an implementation detail that could be fixed?
Terry Reedy writes:
Addition of tallies (base 1 numbers) is concatenation.
Except that nobody actually tallies that way. In practice everybody I've ever seen count to ten by tallying does something special for fives (Americans tally four times then cross the four vertical strokes with a fifth, Japanese build the character sei (正 == true)). So what you're actually arguing is circular: if you represent a tally as a string with a singleton alphabet, it behaves like a string. But in Japanese it's actually a 5 element alphabet, and your homomorphism:
For tallies, or strings, r and s, len(r+s) == len(r) + len(s).
fails. I think that rather than focus on formal properties, we should look at how sum is used in Python. For example, numerical sum is in fact an attractive nuisance as long as floats are considered numbers. So unless and until we deprecate floats, "formal sum" will always be splintered into multiple functions. As far as concat goes, I don't really see what advantage it has over joiner.join, except in teaching. That might be enough for a builtin in this case, since concatenating iterables of strings is a pretty frequent operation in my experience.
On 08/11/2014 02:20 AM, Stephen J. Turnbull wrote:
As far as concat goes, I don't really see what advantage it has over joiner.join, except in teaching. That might be enough for a builtin in this case, since concatenating iterables of strings is a pretty frequent operation in my experience.
I am not so sure about the benefit for teaching. I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle. You have to explain it once, of course, but after that people, especially people not biased by other languages, do not have problems with it. Wolfgang
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier < wolfgang.maier@biologie.uni-freiburg.de> wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
On 08/11/2014 03:56 PM, Alexander Belopolsky wrote:
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de <mailto:wolfgang.maier@biologie.uni-freiburg.de>> wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
Once you have explained the beauty of s == sep.join(s.split(sep)) to anyone, they will not be confused again, but find it perfectly logical.
Wolfgang Maier wrote:
s == sep.join(s.split(sep))
to anyone, they will not be confused again, but find it perfectly logical.
You are kidding, right? What will prevent them from using logic to recall the invariant mistakenly as s == sep.join(sep.split(s)) ? If only because there are natural use cases for passing mysep.join around as a callable that will join with a fixed separator, and there would similarly exist use cases for mysep.split, while there are really none for passing around mystring.split to split a fixed string by varying separators. This whole corner of python is a most ugly system of inconsistent acne scars in the 2.x series (dunno about 3.x). What with sum(...) pedantically telling you to use str.join - if it knows what you want, it should just to it. --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
You are kidding, right? What will prevent them from using logic to recall the invariant mistakenly as s == sep.join(sep.split(s)) ? If only because there are natural use cases for passing mysep.join around as a callable that will join with a fixed separator, and there would similarly exist use cases for mysep.split, while there are really none for passing around mystring.split to split a fixed string by varying separators.
This whole corner of python is a most ugly system of inconsistent acne scars in the 2.x series (dunno about 3.x). What with sum(...) pedantically telling you to use str.join - if it knows what you want, it should just to it.
I tend to agree...
On 11/08/2014 14:56, Alexander Belopolsky wrote:
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de <mailto:wolfgang.maier@biologie.uni-freiburg.de>> wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
Could you try something which to my understanding is unprecedented in the world of computing, as in point them to the docs? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Mon, Aug 11, 2014 at 10:59 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
Could you try something which to my understanding is unprecedented in the world of computing, as in point them to the docs?
Sure - this is the universal remedy to any design mistake, but how many people would need to look at the docs to understand something like join(list, sep=',') or split(string, sep=',')? And how many of those seeing x.join(y) and x.split(y) for the first time will guess which argument is data and which is separator? Beautiful is better than ugly. Explicit is better than implicit. .. Readability counts. .. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. .. (The Zen of Python, by Tim Peters, AKA python -m this)
On Mon, Aug 11, 2014 at 8:47 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 10:59 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
Could you try something which to my understanding is unprecedented in the world of computing, as in point them to the docs?
Sure - this is the universal remedy to any design mistake, but how many people would need to look at the docs to understand something like join(list, sep=',') or split(string, sep=',')? And how many of those seeing x.join(y) and x.split(y) for the first time will guess which argument is data and which is separator?
Two new builtins now? "x.split(y)" is ambiguous because its terrible code, not because of the order. Names matter and exist to provide clarity. Don't name variables 'x' and 'y'. If it were as simple as "x.split(sep)" or "x.split('\t')" then I bet almost no one will need to consult the documentation to know which is which; similarly with "sep.join(list_of_strings)". Now for someone wanting to know how to join a list of strings and doesn't know how, sep.join(strings) is not per se obviously discoverable, but it isn't that bad and people will have to go to the docs anyways. They are likely to look in the string section I think, and find it. They might look in the list section, and may look in the bulitins section, but its still not that bad. As for how many would need to go look it up with your proposed builtins, anyone who wanted to set the separator, that's how many. And considering "," is not even kind of a sensible default for either, that'd be a lot. I remember this discussion very clearly when sum went in, and again during the py3k period, and here the zombie rises again. Beautiful is better than ugly.
Explicit is better than implicit. .. Readability counts. .. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. ..
(The Zen of Python, by Tim Peters, AKA python -m this)
The Zen are nice words and good pieces of advice that are not a tool to wield to win an debate, especially since most of them are fundamentally subjective and require you to think in Python for their full meaning to have an impact. They were also written after the fact by a long margin as a joke, even if like most jokes they have a measure of truth in them. Adding builtins defies 'Explicit is better then implicit', and though ",".join(stuff) is arguably less then beautiful, the axiom says *better* not *best*, and the order of things to almost any function at all could be called ambiguous until you look it up. Finally, at least ",".join(stuff) directly (and explicitly!) refuses any temptation to guess. Zen aside, I'm not sure if sum(list_of_strings) is that pancea of readability and beauty and who-needs-documentation-now that its sort of implied in this thread, that it doesn't work for strings is for reasons already specified (CPython's immutable strings implies it could lead to *BROKEN* behavior in a reasonable implementation, even if CPython currently has a limited optimization). I ever so much don't believe we need "join" and "split" builtins, I much rather think your arguments for forcing people to use str.join make more sense, but I hope we don't see that either. This isn't a situation that causes a huge amount of confusion. Teach the students that sep.join(list_of_strings) make sense because who would know how to combine strings, but a string itself? Of course list_of_strings wouldn't know this. Watch the awareness blossom. Sometimes I think people overstate how big a deal things are in teaching because we all care about making Python teachable. --S
On Mon, Aug 11, 2014 at 12:21 PM, Stephen Hansen <me+python@ixokai.io> wrote:
Don't name variables 'x' and 'y'. If it were as simple as "x.split(sep)" or "x.split('\t')" then I bet almost no one will need to consult the documentation to know which is which
Sure, but on the same token if someone writes sep.split(x), how likely will this error be caught on a quick review? This is not theoretical. I've seen people make this mistake and asking for help in debugging rather non-obvious behaviors.
On Mon, Aug 11, 2014 at 9:38 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 12:21 PM, Stephen Hansen <me+python@ixokai.io> wrote:
Don't name variables 'x' and 'y'. If it were as simple as "x.split(sep)" or "x.split('\t')" then I bet almost no one will need to consult the documentation to know which is which
Sure, but on the same token if someone writes sep.split(x), how likely will this error be caught on a quick review? This is not theoretical. I've seen people make this mistake and asking for help in debugging rather non-obvious behaviors.
I've seen people make innumerable mistakes in the past, I've seen certain classes of mistakes repeated -- this isn't an argument for change by itself. People make mistakes. Its going to happen. That said, I find the idea that "x.split(sep)" as non-obvious to be... weird, to say the least. But, obvious is subjective. Your obvious may not be my obvious nor most people's obvious. Yes, sep.join(list) is a bit of a weird construct, but its one thing to learn, and its not a hard one to teach at that. In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory. But, string.split(sep) is very natural. You seem to think that they need to be in the same order to be obvious but I don't see why nor do I think any of the alternatives are not without problems that are bigger issues.
In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory.
Could you elaborate about "making people think more pythonically after" bit? I can see how explaining the API makes people understand the API, but I'm curious how it makes people behave differently after. On Mon, Aug 11, 2014 at 9:38 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 12:21 PM, Stephen Hansen <me+python@ixokai.io> wrote:
Don't name variables 'x' and 'y'. If it were as simple as "x.split(sep)" or "x.split('\t')" then I bet almost no one will need to consult the documentation to know which is which
Sure, but on the same token if someone writes sep.split(x), how likely will this error be caught on a quick review? This is not theoretical. I've seen people make this mistake and asking for help in debugging rather non-obvious behaviors.
I've seen people make innumerable mistakes in the past, I've seen certain classes of mistakes repeated -- this isn't an argument for change by itself. People make mistakes. Its going to happen. That said, I find the idea that "x.split(sep)" as non-obvious to be... weird, to say the least. But, obvious is subjective. Your obvious may not be my obvious nor most people's obvious. Yes, sep.join(list) is a bit of a weird construct, but its one thing to learn, and its not a hard one to teach at that. In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory. But, string.split(sep) is very natural. You seem to think that they need to be in the same order to be obvious but I don't see why nor do I think any of the alternatives are not without problems that are bigger issues. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Aug 11, 2014 at 10:10 AM, Haoyi Li <haoyi.sg@gmail.com> wrote:
In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory.
Could you elaborate about "making people think more pythonically after" bit? I can see how explaining the API makes people understand the API, but I'm curious how it makes people behave differently after.
Well, its an opinion of course, so it may be as useful as the thought that the sky is high. It doesn't make people behave differently, but it does lead them to writing idiomatic code. There's naive Python code, which ideally should be clear and easy still, and then there's idiomatic Python code which has a beauty in its expressiveness while also being (relatively) efficient. But, if the opposite of "string.split(sep)" were "list_of_strings.join(sep)" then that would mean that List would have to know about how to combine strings. That's not very Pythonic. Also, Strings are immutable, that means: str1 = str1 + str2 logically leads one to assume that it is constructing an entirely new string, copying the contents of str1 and str2 into it, and discarding the original str1. In fact, it did that for a long time. You can't change strings, after all. That is an fundamental tenet, and when people learn that they will get a new tool and think closer to idiomatic Python. Now, CPython has an optimization for this case which is the only reason the idea of sum(list_of_strings) is not largely fundamentally broken, but that's not a promise and not a feature of the language. IMHO.
One reason I think this is confusing is that "list.join(sep)" is how basically everyone else does it, including: Ruby, Javascript, Scala, Guava-Java (normal java doesn't have anything). People who don't do it this way include C# (static method), Boost-C++ (static method), PHP (static method) Nobody does it the Python way AFAICT. On Mon, Aug 11, 2014 at 10:45 AM, Stephen Hansen <me+python@ixokai.io> wrote:
On Mon, Aug 11, 2014 at 10:10 AM, Haoyi Li <haoyi.sg@gmail.com> wrote:
In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory.
Could you elaborate about "making people think more pythonically after" bit? I can see how explaining the API makes people understand the API, but I'm curious how it makes people behave differently after.
Well, its an opinion of course, so it may be as useful as the thought that the sky is high. It doesn't make people behave differently, but it does lead them to writing idiomatic code. There's naive Python code, which ideally should be clear and easy still, and then there's idiomatic Python code which has a beauty in its expressiveness while also being (relatively) efficient.
But, if the opposite of "string.split(sep)" were "list_of_strings.join(sep)" then that would mean that List would have to know about how to combine strings. That's not very Pythonic. Also, Strings are immutable, that means:
str1 = str1 + str2
logically leads one to assume that it is constructing an entirely new string, copying the contents of str1 and str2 into it, and discarding the original str1. In fact, it did that for a long time. You can't change strings, after all. That is an fundamental tenet, and when people learn that they will get a new tool and think closer to idiomatic Python. Now, CPython has an optimization for this case which is the only reason the idea of sum(list_of_strings) is not largely fundamentally broken, but that's not a promise and not a feature of the language.
IMHO.
On 11/08/2014 18:51, Haoyi Li wrote:
One reason I think this is confusing is that "list.join(sep)" is how basically everyone else does it, including: Ruby, Javascript, Scala, Guava-Java (normal java doesn't have anything). People who don't do it this way include C# (static method), Boost-C++ (static method), PHP (static method)
Nobody does it the Python way AFAICT.
So one and only one language got it right and the rest got it wrong, what about it? Now is it possible to leave this vitally important issue and move on to something trivial, e.g. helping people port Python 2 code to Python 3? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 2014-08-11, at 19:51 , Haoyi Li <haoyi.sg@gmail.com> wrote:
One reason I think this is confusing is that "list.join(sep)" is how basically everyone else does it, including: Ruby, Javascript, Scala, Guava-Java (normal java doesn't have anything). People who don't do it this way include C# (static method), Boost-C++ (static method), PHP (static method)
Nobody does it the Python way AFAICT.
One could easily enough argue that MLs and Haskell do: although they don't generally have objects (and thus no method), their signature is (function name may vary) join :: String -> [String] -> String join separator list which — in haskell — can be infixed to separator `intercalate` list and which in both languages nicely lends itself to partial application (much like Python's `sep.join` expression)
On Aug 11, 2014, at 10:45, Stephen Hansen <me+python@ixokai.io> wrote:
... if the opposite of "string.split(sep)" were "list_of_strings.join(sep)" then that would mean that List would have to know about how to combine strings. That's not very Pythonic.
I think the real problem here is that almost every OO language does it this way. Beginning programmers don't even know to look for a join function, so it doesn't matter where we put it. But someone coming to Python looking for the equivalent of -[NSArray componentsJoinedBySeparator:] or Enumerable#join or Sequence.joinStrings or whatever expects to find it as list.join. The fact that all of these other languages agreed on a stupid solution and Python on a more sensible one (in most of those languages, just as in Python, the enumeration/iteration/etc. protocol is a universal thing that every type already has to understand, while concatenating strings is something specific to strings) doesn't change the fact that they all agreed. The question is whether this is more of a positive, as an opportunity to get people thinking Pythonically early instead of trying to write Ruby or C# code in Python, or a negative, as a stumbling block to them using the language. Of course the obvious solution is to spell it like C++: stringstream ss; copy(a.begin(), a.end(), ostream_iterator(ss)); s = ss.str(); Then nobody will find it at all, no matter what language they come from; problem solved. :)
Also, Strings are immutable, that means:
str1 = str1 + str2
logically leads one to assume that it is constructing an entirely new string, copying the contents of str1 and str2 into it, and discarding the original str1. In fact, it did that for a long time. You can't change strings, after all. That is an fundamental tenet, and when people learn that they will get a new tool and think closer to idiomatic Python. Now, CPython has an optimization for this case which is the only reason the idea of sum(list_of_strings) is not largely fundamentally broken, but that's not a promise and not a feature of the language.
IMHO.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 8/11/2014 3:13 PM, Andrew Barnert wrote:
Beginning programmers don't even know to look for a join function,
The rather complete index, where they will find it as a bytearray/bytes/str functions, as well as variations of thread.join.
But someone coming to Python looking for the equivalent of -[NSArray componentsJoinedBySeparator:] or Enumerable#join or Sequence.joinStrings or whatever expects to find it as list.join.
When they don't find it, there is the index. The first thing people should learn is that the Python doc set includes a tutorial for beginners, a Reference manual for syntax, a Library manual for objects, and a collective Index for all of these. The Index starts with a Symbol page for looking up thing like '*' or '|=' that do not work in search engines. -- Terry Jan Reedy
On 08/11/2014 02:17 PM, Terry Reedy wrote:
The first thing people should learn is that the Python doc set includes a tutorial for beginners, a Reference manual for syntax, a Library manual for objects, and a collective Index for all of these. The Index starts with a Symbol page for looking up thing like '*' or '|=' that do not work in search engines.
+1 Before using any language one should familiarize oneself with it, and what better way than the tutorial that comes with the language? (And no, I've never read it -- but I have read others, and several books, and a couple classes to fill in the chinks. ;) -- ~Ethan~
On Mon, Aug 11, 2014 at 12:47 PM, Stephen Hansen <me+python@ixokai.io> wrote:
Yes, sep.join(list) is a bit of a weird construct, but its one thing to learn, and its not a hard one to teach at that. In fact, it makes very logical sense once you explain it and makes people think of things more Pythonically after. I say from experience, not in theory. But, string.split(sep) is very natural. You seem to think that they need to be in the same order to be obvious but I don't see why nor do I think any of the alternatives are not without problems that are bigger issues.
I am not suggesting any changes to str.join or str.split methods. I am just arguing that sum(list_of_strings, '') should be allowed by the language and its performance is a matter of the quality of implementation. Once you learn that string addition is concatenation in Python, it is natural to "sum" lists of Python strings regardless of whether it makes sense in your native language. sep.join(list) is not such a weird construct when sep is non-empty - it is the sep='' case which is weird and non-obvious. (Note that someone in this thread suggested demonstrating s == sep.join(s.split(sep)) invariant as a teaching tool, but this invariant fails when sep is empty.) When you are tasked with finding s1 + s2 + ... + sN given [s1, s2, ..., sN], it is sum that first comes to mind, not join. The situation is different when you have a separator to begin with, but when you don't using an empty separator feels like a performance hack in the absence of an efficient natural solution.
On Mon, Aug 11, 2014 at 10:14 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
it makes sense in your native language.
No, it makes no sense at all in my native language. sum is, fundamentally, the total of numbers in my native language.
When you are tasked with finding s1 + s2 + ... + sN given [s1, s2, ..., sN], it is sum that first comes to mind, not join.
It would have never occurred to me in a million years to address such a problem with sum. The first thing that'd likely have come to my mind a million years ago when I first started learning Python would be to look for some concat function, but failing that I'd run across join in the docs.
The situation is different when you have a separator to begin with, but when you don't using an empty separator feels like a performance hack in the absence of an efficient natural solution.
I'm pretty much going to just bow out at this point because the idea that "".join(list_of_strings) is some weird and non-obvious thing but when it has a separator, its clear and obvious, doesn't make any sense at all to me. That seems to fly directly in the face of "there should be one...." which you previously quoted.
On 8/11/2014 1:14 PM, Alexander Belopolsky wrote:
sep.join(list) is not such a weird construct when sep is non-empty - it is the sep='' case which is weird and non-obvious. (Note that someone in this thread suggested demonstrating s == sep.join(s.split(sep)) invariant as a teaching tool, but this invariant fails when sep is empty.)
Because s.split('') raises "ValueError: empty separator". I expected the result should be the same as list(s), making the invariant above true. But perhaps Guido thought splitting on '' might be a bug. re.split('', s) returns s, which seems wrong. The doc talks about 'occurences of the pattern'. I *see* an occurance of '' at every slice point. In a sense, Python slicing does too.
'abc'[1:1] ''
When I first learned Python, I knew to test edge case behavior rather than depend on my interpretation of docs, or worse, my expectations from previous experience and knowledge. The interactive interpreter makes little tests like 'ab'split('') and re.split('', 'ab') trivial and faster than reading (or debugging). -- Terry Jan Reedy
On 11.08.2014 19:14, Alexander Belopolsky wrote:
sep.join(list) is not such a weird construct when sep is non-empty - it is the sep='' case which is weird and non-obvious. (Note that someone in this thread suggested demonstrating s == sep.join(s.split(sep)) invariant as a teaching tool, but this invariant fails when sep is empty.)
For the record, this doesn't fail because of any weirdness about ''.join(s). It's just that s.split() does not take an empty string as separator. So, ok, I should have said: s == sep.join(s.split(sep)) for any allowed sep (which should be an obvious requirement) but this has nothing to do with the rest of the discussion other that it is a bit peculiar that join and sep do not act perfectly symmetrical. On the other hand, a builtin function sum and a string method split would be alot more asymmetric.
When you are tasked with finding s1 + s2 + ... + sN given [s1, s2, ..., sN], it is sum that first comes to mind, not join.
Not *my* first thought when it comes to strings, but if it is yours, then you try it once and you get an appropriate error message pointing you to the correct solution. Ok for me.
The situation is different when you have a separator to begin with, but when you don't using an empty separator feels like a performance hack in the absence of an efficient natural solution.
On Mon, Aug 11, 2014 at 3:56 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier < wolfgang.maier@biologie.uni-freiburg.de> wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
What would be the solution to this? Making join a list method, or reversing the behavior of split which means it no longer acts on x?
On Mon, Aug 11, 2014 at 11:55 AM, Todd <toddrjen@gmail.com> wrote:
In my experience, it is the asymmetry between x.join(y) and x.split(y)
which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
What would be the solution to this?
Allow sum(list_of_strings, '') and stop mocking people who prefer it to ''.join(..). This will not solve all the issues with join/split, but at least a simple task of concatenating a list of strings will have a more or less obvious solution.
On Mon, Aug 11, 2014 at 6:22 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 11:55 AM, Todd <toddrjen@gmail.com> wrote:
In my experience, it is the asymmetry between x.join(y) and x.split(y)
which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
What would be the solution to this?
Allow sum(list_of_strings, '') and stop mocking people who prefer it to ''.join(..). This will not solve all the issues with join/split, but at least a simple task of concatenating a list of strings will have a more or less obvious solution.
I am confused, if it won't solve the problem, how is it relevant to my question?
On Mon, Aug 11, 2014 at 9:22 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 11:55 AM, Todd <toddrjen@gmail.com> wrote:
In my experience, it is the asymmetry between x.join(y) and x.split(y)
which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
What would be the solution to this?
Allow sum(list_of_strings, '') and stop mocking people who prefer it to ''.join(..). This will not solve all the issues with join/split, but at least a simple task of concatenating a list of strings will have a more or less obvious solution.
Then we'll have two obvious solutions since "".join() and the huge body of code using it won't go away. One which is only even vaguely workable because of an implementation-specific optimization that isn't a promise of the language, which makes it at best obvious* if not obvious-ish.
On Mon, Aug 11, 2014 at 12:37 PM, Stephen Hansen <me+python@ixokai.io> wrote:
Then we'll have two obvious solutions since "".join() and the huge body of code using it won't go away.
''.join() is not an obvious solution to concatenation. It is a fringe case of a solution to a completely different problem - building a delimited string. Buy the same token, we have "two obvious solutions" to negation: -x and 0-x.
On Mon, Aug 11, 2014 at 5:22 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Mon, Aug 11, 2014 at 11:55 AM, Todd <toddrjen@gmail.com> wrote:
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator and y is the data being joined, but in x.split(y), it is the other way around.
What would be the solution to this?
Allow sum(list_of_strings, '') and stop mocking people who prefer it to ''.join(..). This will not solve all the issues with join/split, but at least a simple task of concatenating a list of strings will have a more or less obvious solution.
I don't have any data here, but I bet people who know about str.join (even for its natural use cases like ", ".join(...)) outnumber the people who know that sum() takes a second argument by a very large factor. Of course this also means that sum()'s special error message is probably pretty ineffective at reaching the people it's trying to educate -- to do that we'd need to warn on str += str or something, which is clearly not happening. So I can see the argument for just making sum(iterable_of_strings, "") fast. But practically speaking, how would this work? In general str.join and sum have different semantics. What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 08/11/2014 10:53 AM, Nathaniel Smith wrote:
But practically speaking, how would this work? In general str.join and sum have different semantics. What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)?
The same thing that happens now if you pass a list to join with a non-string entry: --> ' '.join(['some', 'list', 'of', 'words', 'and', 10, 'as', 'a', 'number']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 5: expected string, int found -- ~Ethan~
On Mon, Aug 11, 2014 at 10:10 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 08/11/2014 10:53 AM, Nathaniel Smith wrote:
But practically speaking, how would this work? In general str.join and sum have different semantics. What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)?
The same thing that happens now if you pass a list to join with a non-string entry:
--> ' '.join(['some', 'list', 'of', 'words', 'and', 10, 'as', 'a', 'number']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 5: expected string, int found
class Nasty: def __radd__(self, other): return other + "foo" "".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "") -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 08/11/2014 02:25 PM, Nathaniel Smith wrote:
On Mon, Aug 11, 2014 at 10:10 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 08/11/2014 10:53 AM, Nathaniel Smith wrote:
But practically speaking, how would this work? In general str.join and sum have different semantics. What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)?
The same thing that happens now if you pass a list to join with a non-string entry:
--> ' '.join(['some', 'list', 'of', 'words', 'and', 10, 'as', 'a', 'number']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 5: expected string, int found
class Nasty: def __radd__(self, other): return other + "foo"
"".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "")
Quite frankly, I regard this as a point in sum's favor. We have, effectively, a string-subclass and join chokes on it. -- ~Ethan~
On 11 Aug 2014 22:36, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 08/11/2014 02:25 PM, Nathaniel Smith wrote:
On Mon, Aug 11, 2014 at 10:10 PM, Ethan Furman <ethan@stoneleaf.us>
wrote:
On 08/11/2014 10:53 AM, Nathaniel Smith wrote:
But practically speaking, how would this work? In general str.join and sum have different semantics. What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)?
The same thing that happens now if you pass a list to join with a
non-string
entry:
--> ' '.join(['some', 'list', 'of', 'words', 'and', 10, 'as', 'a', 'number']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 5: expected string, int found
class Nasty: def __radd__(self, other): return other + "foo"
"".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "")
Quite frankly, I regard this as a point in sum's favor. We have, effectively, a string-subclass and join chokes on it.
Yes, but the proposal I was responding to was to do something like def sum(it, start=0): if start == "": return "".join(it) ... regular sum impl here ... And the point is that this is not trivially a transparent optimization. -n
On 08/11/2014 02:53 PM, Nathaniel Smith wrote:
Quite frankly, I regard this as a point in sum's favor. We have, effectively, a string-subclass and join chokes on it.
Yes, but the proposal I was responding to was to do something like
def sum(it, start=0): if start == "": return "".join(it) ... regular sum impl here ...
And the point is that this is not trivially a transparent optimization.
Ah, gotcha. Even without an optimization in sum, I'd still rather it did what it said: add things together. -- ~Ethan~
Ethan Furman writes:
Even without an optimization in sum, I'd still rather it did what it said: add things together.
But strings aren't additive, they're multiplicative:
"x" "y" 'xy'
This is a convention deeply embedded in mathematics notation (multiplicative operators are noncommutative and frequently omitted in notation, additive operators are commutative and explicit notation required). The point is not that it's *wrong* to think that "strings are additive, so sum *should* apply to iterables of strings". Rather, it's that it's *not wrong* to think that "strings are non-additive and therefore not sum()-able". (I'm somewhere in between.) Why choose '+' as the string multiplication operator, then? Because programming languages aren't as flexible as mathematics notation, compromises are often made. Among other things, '*' also has a meaning for strings (when the other operand is an integer it means repetition -- oddly enough, it's commutative!) I suppose Guido could have been a pure algebraist about it and chosen '*' and '^' for concatenation and repetition, respectively, but that looks odd to me. The lesson in the end is that although there are many functions that are convenient to express in Python using operator notation, the choice of operators should not be taken too seriously in deciding how they will compose.
On 11.08.2014 23:35, Ethan Furman wrote:
class Nasty: def __radd__(self, other): return other + "foo"
"".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "")
Interesting. So a slight variation enables sum-based string concatenation *right now*: class ZeroAdditionProperty: def __add__(self, other): return other nullstr = ZeroAdditionProperty() sum(["some", "strings", "and", "one"], nullstr) Wolfgang
On 8/11/2014 5:35 PM, Ethan Furman wrote:
On 08/11/2014 02:25 PM, Nathaniel Smith wrote:
class Nasty: def __radd__(self, other): return other + "foo"
"".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "")
I don't understand the point of this.
Quite frankly, I regard this as a point in sum's favor. We have, effectively, a string-subclass and join chokes on it.
Nasty is a subclass of object, with no default value. Make it a real str subclass and join works fine. class Nasty(str): def __radd__(self, other): return other + "foo" print("".join(["some", "strings", "and", "one", Nasty()]))
somestringsandone
-- Terry Jan Reedy
On 12.08.2014 00:15, Terry Reedy wrote:
On 8/11/2014 5:35 PM, Ethan Furman wrote:
On 08/11/2014 02:25 PM, Nathaniel Smith wrote:
class Nasty: def __radd__(self, other): return other + "foo"
"".join(["some", "strings", "and", "one", Nasty()]) sum(["some", "strings", "and", "one", Nasty()], "")
I don't understand the point of this.
Quite frankly, I regard this as a point in sum's favor. We have, effectively, a string-subclass and join chokes on it.
Nasty is a subclass of object, with no default value. Make it a real str subclass and join works fine.
class Nasty(str): def __radd__(self, other): return other + "foo"
print("".join(["some", "strings", "and", "one", Nasty()]))
somestringsandone
No, it's not, at least not as intended or the result would be somestringsandonefoo The point about Ethan's example is that join only works with str and subclasses thereof, but not with proxy classes wrapping a str object.
On 8/11/2014 6:23 PM, Wolfgang Maier wrote:
On 12.08.2014 00:15, Terry Reedy wrote:
Nasty is a subclass of object, with no default value. Make it a real str subclass and join works fine.
class Nasty(str): def __radd__(self, other): return other + "foo"
print("".join(["some", "strings", "and", "one", Nasty()]))
If one runs a program with a print statement from an Idle editor,
somestringsandone
the shell prints the >>> prompt and then the output. I copy the prompt as a separator.
No, it's not,
I have been copy and pasting python code and output for over 17 years. You could have verified that I did so accurately here in less than a minute. I keep the Idle icon pinned to my task bar and usually have a scratch file for experiments in its recent files list, if not already open, to make this sort of thing easy.
at least not as intended
I was talking about reality, not intentions that were not clear to me. -- Terry Jan Reedy
On 08/11/2014 05:15 PM, Terry Reedy wrote:
On 8/11/2014 6:23 PM, Wolfgang Maier wrote:
On 12.08.2014 00:15, Terry Reedy wrote:
Nasty is a subclass of object, with no default value. Make it a real str subclass and join works fine.
class Nasty(str): def __radd__(self, other): return other + "foo"
print("".join(["some", "strings", "and", "one", Nasty()]))
If one runs a program with a print statement from an Idle editor,
somestringsandone
the shell prints the >>> prompt and then the output. I copy the prompt as a separator.
No, it's not,
I have been copy and pasting python code and output for over 17 years. You could have verified that I did so accurately here in less than a minute. I keep the Idle icon pinned to my task bar and usually have a scratch file for experiments in its recent files list, if not already open, to make this sort of thing easy.
at least not as intended
I was talking about reality, not intentions that were not clear to me.
His remark was pointed at the fact that your output is missing the final "foo". Remove the 'r' from __radd__, though, and you would have what you were trying to demonstrate. -- ~Ethan~
Ethan Furman writes:
His remark was pointed at the fact that your output is missing the final "foo". Remove the 'r' from __radd__, though, and you would have what you were trying to demonstrate.
Believe someone when he says he copy-pasted. ;-) Unless you've actually run the code and got a different result, and even then you should probably include version information etc. The presence of __radd__ (or __add__, for that matter, although it's reasonably difficult to create a str derivative without it) is irrelevant to why "".join works when Nasty is derived from str. You're confusing the actual semantics of str.join (repeated copying at appropriate offsets into a sufficiently large buffer) with the naive implementation of sum (an iterated application of '+').
On 08/12/2014 08:29 AM, Stephen J. Turnbull wrote:
Ethan Furman writes:
His remark was pointed at the fact that your output is missing the final "foo". Remove the 'r' from __radd__, though, and you would have what you were trying to demonstrate.
Believe someone when he says he copy-pasted. ;-) Unless you've actually run the code and got a different result, and even then you should probably include version information etc.
The presence of __radd__ (or __add__, for that matter, although it's reasonably difficult to create a str derivative without it) is irrelevant to why "".join works when Nasty is derived from str. You're confusing the actual semantics of str.join (repeated copying at appropriate offsets into a sufficiently large buffer) with the naive implementation of sum (an iterated application of '+').
Exactly. So my point was that when you don't subclass str, but instead use a wrapper around it, you can give it a as str-like interface as you want so the thing looks and feels like a string to users, it will still not work as part of an iterable passed to .join (because .join is C code moving around arrays of chars and is completely ignorant of all the nice methods you added to your object). Sum on the other hand knows how to use .__add__ and .__radd__ . Not that I think this is an argument against str.join() - it is a specialized method to join strings efficiently and it's good at that (in fact, I think, that's why it makes complete sense to have it implemented as a str method because this and nothing else is the data type it can handle). I just found Ethan's Nasty() example interesting. Wolfgang
Wolfgang Maier writes:
Exactly. So my point was that when you don't subclass str, but instead use a wrapper around it, you can give it a as str-like interface as you want so the thing looks and feels like a string to users, it will still not work as part of an iterable passed to .join
You mean this behavior? wideload:~ 12:42$ python3.2
...
class N: ... def __init__(self, s=''): ... self.s = s ... def __str__(self): ... return self.s ... " ".join(['a', N('b')]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 1: expected str instance, N found ' '.join(str(x) for x in ['a', N('b')]) 'a b'
Given the fact that every object is str-able, I don't think we want to give "str(x) for x in" semantics to str.join. So I think the answer is "if you want Nasty to automatically acquire all the behaviors of str, make it a subclass of str". I can't think of a use case where subclassing would be problematic.
Sum on the other hand knows how to use .__add__ and .__radd__ .
It seems to me that that's a strong argument against "summing strings" with the current implementation of sum(), given the ease with which you can construct types where the "sum" of an iterable can be implemented efficiently and gives the same answer as the generic algorithm based on '+', but the generic algorithm is inefficient (just make it immutable). I suppose most Sequence types are arrays of pointers at the C level, or otherwise implement O(1) '+=', so either the join-style "just memmove the arrays into a sufficiently large buffer", or iterated '+=', does the trick for an efficient generic sum. This just guesswork, though.
On 13 August 2014 14:38, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Wolfgang Maier writes:
Exactly. So my point was that when you don't subclass str, but instead use a wrapper around it, you can give it a as str-like interface as you want so the thing looks and feels like a string to users, it will still not work as part of an iterable passed to .join
You mean this behavior?
wideload:~ 12:42$ python3.2
...
class N: ... def __init__(self, s=''): ... self.s = s ... def __str__(self): ... return self.s ... " ".join(['a', N('b')]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 1: expected str instance, N found ' '.join(str(x) for x in ['a', N('b')]) 'a b'
Given the fact that every object is str-able, I don't think we want to give "str(x) for x in" semantics to str.join. So I think the answer is "if you want Nasty to automatically acquire all the behaviors of str, make it a subclass of str".
Note that this is a general problem - it is quite common to use explicit type checks against str rather than relying on ducktyping. In theory, a suitable ABC could be defined (using collections.UserString as a starting point), but nobody has ever found it a pressing enough problem to take the time to do so - it's generally easier to just inherit from str. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 08/13/2014 12:50 AM, Nick Coghlan wrote:
Wolfgang Maier writes:
Exactly. So my point was that when you don't subclass str, but instead use a wrapper around it, you can give it a as str-like interface as you want so the thing looks and feels like a string to users, it will still not work as part of an iterable passed to .join
You mean this behavior?
wideload:~ 12:42$ python3.2
>>> ... >>>class N: ... def __init__(self, s=''): ... self.s = s ... def __str__(self): ... return self.s ... >>>" ".join(['a', N('b')]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 1: expected str instance, N found >>>' '.join(str(x) for x in ['a', N('b')]) 'a b' >>>
Given the fact that every object is str-able, I don't think we want to give "str(x) for x in" semantics to str.join. So I think the answer is "if you want Nasty to automatically acquire all the behaviors of str, make it a subclass of str". Note that this is a general problem - it is quite common to use explicit type checks against str rather than relying on ducktyping. In
On 13 August 2014 14:38, Stephen J. Turnbull<stephen@xemacs.org> wrote: theory, a suitable ABC could be defined (using collections.UserString as a starting point), but nobody has ever found it a pressing enough problem to take the time to do so - it's generally easier to just inherit from str.
Is there a way to select a method more specifically on it's mixin? thing.method # any like named method thing.method|mixin # Only if it's from mixin Where method is spelled the same on differnt types, but it's actual operation may be different. Obviously that spelling won't work, but the idea is to allow a more fine grained method selection. thing.__add__|number(other) thing.__add__|sequence(other) thing.__add__|container(other) thing.__add__|str(other) So if I wanted to join strings but not containers or numbers, I could use the __add__|str method. And conversely if I wanted to iterate nested containers without iterating strings too, I could use __iter__|container method. Now that I think about it a bit more, it probably would be spelled.. mixin.method(thing, other) But maybe for the same reason we don't normally call a class method directly applies? class.method(thing, other) I think in most cases, the difference might be getting an attribute error early, vs a type error a bit later. But it seems to me there may be other differences/gotcha's in the case of calling a mixin or class method directly. Currently I'd add a type check before calling the method, but I'd like the finer grained method resolution over the type check. Cheers, Ron
On 08/11/2014 11:29 PM, Stephen J. Turnbull wrote:
Ethan Furman writes:
His remark was pointed at the fact that your output is missing the final "foo". Remove the 'r' from __radd__, though, and you would have what you were trying to demonstrate.
Believe someone when he says he copy-pasted. ;-) Unless you've actually run the code and got a different result, and even then you should probably include version information etc.
I did believe -- I just also forgot that Python would use __radd__ in a subclass, and so thought that was a bug. But you're right, I should have tried it myself before posting -- my apologies. -- ~Ethan~
Nathaniel Smith writes:
I don't have any data here, but I bet people who know about str.join (even for its natural use cases like ", ".join(...)) outnumber the people who know that sum() takes a second argument by a very large factor.
This is easy to fix, it now occurs to me. Allow types with __add__ to provide an optional __sum__ method, and give the numeric ABC a default __sum__ implementation. (It would be nice if it could check for floats and restart with fsum if one is encountered. And of course, there may be other ABCs with __add__ that could get a default __sum__.) Then sum could be just def sum(itr, start=0): if start = 0: itr = iter(itr) start = next(itr) return start.__sum__(itr) and class str(...): def __sum__(self, itr): return self + ''.join(itr) # probably can be optimized Is it really it worth it, though?
But practically speaking, how would this work? In general str.join and sum have different semantics.
sum(iter_of_str) currently doesn't have semantics. The semantics proponents of sum() seem to expect is precisely ''.join(iter_of_str). Where's the problem?
What happens if we descend deep into the iterable and then discover a non-string (that might nonetheless still have a + operator)?
We lose, er, an exception is raised. Why is that a problem? I think most people who want a polymorphic sum() expect it to accept a homogeneous iterable as the first argument. I don't think they have expectations that sum will be equivalent to def new_sum(it, start=0): # compatible signature ;-) it = iter(it) result = result or next(it) for x in it: result = result + next(it) return result for heterogeneous iterables. Among other things, how do you decide the appropriate return type? start's? That of next(iter(it))? The "most important" of the types in it? Ask for a BDFL pronouncement at each invocation? I suppose you could ask that functions that operate on iterables be partially applicable in the sense that if they *do* raise on the "wrong" type, the exception should provide a partial result, the oddball operand, and an iterable containing the unconsumed operands as attributes. Then the __sum__ method could handle heterogeneous operands if it wants to. Note that partial_sum + oddball may have a different type from the expected one even if it works. This seems like a recipe for bugs to me. Are there use cases for such heterogenous sums? The only exception that might be pretty safe would be a case where you can coerce the oddball to the partial result's type. But in the salient case of str, pretty much every x has a str(x). I don't think that an optimized version of: def new_sum(iter, start): expected_type = type(start) result = start for x in iter: try: result = result + x except TypeError: result = result + expected_type(x) return result is really what we want when type(start) == str, so it probably shouldn't be default, and probably not when type(start) is numeric, either.
On 8/11/2014 9:56 AM, Alexander Belopolsky wrote:
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de <mailto:wolfgang.maier@biologie.uni-freiburg.de>> wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator
Given that the two parameters of join are a concrete string and the abstraction 'iterable of strings', join can only be a method of the joiner. I would first teach ' '.join(it_of_strings) as the 'normal' case of joining 'words', along with print with the default sep = ' '.
and y is the data being joined, but in x.split(y), it is the other way around.
*If* sep is present, then sep.split(string) would be possible. But when sep is *not* not present, split cannot be a method of something that is not there. So I think I would teach s.split() first and then add .split(sep) and .splitlines(). I would also teach join and split together since they are, at their cores (excluding special cases), inverses. -- Terry Jan Reedy
On 11.08.2014 21:55, Terry Reedy wrote:
On 8/11/2014 9:56 AM, Alexander Belopolsky wrote:
On Mon, Aug 11, 2014 at 2:56 AM, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de <mailto:wolfgang.maier@biologie.uni-freiburg.de>>
wrote:
I am using Python for teaching programming to absolute beginners at university and, in my experience, joiner.join is never a big hurdle.
In my experience, it is the asymmetry between x.join(y) and x.split(y) which causes most of the confusion. In x.join(y), x is the separator
Given that the two parameters of join are a concrete string and the abstraction 'iterable of strings', join can only be a method of the joiner.
I would first teach ' '.join(it_of_strings) as the 'normal' case of joining 'words', along with print with the default sep = ' '.
and y is the data being joined, but in x.split(y), it is the other way around.
*If* sep is present, then sep.split(string) would be possible. But when sep is *not* not present, split cannot be a method of something that is not there. So I think I would teach s.split() first and then add .split(sep) and .splitlines().
I would also teach join and split together since they are, at their cores (excluding special cases), inverses.
I like to show students early on that with Python they can do things very quickly that would be very hard to achieve manually. Most of my students are biologists so they do not think initially about theoretical aspects of programming much, but want to know whether they can do something with it fast. For join/split, I typically use problems like: - you'd like to use program xy to work on your data, but it expects input elements to be separated by semicolons when all you have is a tab-delimited format or - you have input consisting of numbers with ',' as the decimal separator (the default in German-speaking countries), but downstream software expects '.' ,i.e., they can all be solved with the general pattern: s = new_sep.join(s.split(old_sep)) Seeing that, in Python, you can solve these problems (in principle) with one line of quite understandable code is a very convincing argument for starting to learn the language. Wolfgang
- you have input consisting of numbers with ',' as the decimal separator (the default in German-speaking countries), but downstream software expects '.'
,i.e., they can all be solved with the general pattern:
s = new_sep.join(s.split(old_sep))
In [1]: 'a,b'.replace(',','.') Out[1]: 'a.b'
On Monday, August 11, 2014 12:56 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Given that the two parameters of join are a concrete string and the abstraction 'iterable of strings', join can only be a method of the joiner.
In many languages (those with generic types), the notion of Sequence<String>, or S:Sequence where S.ElementType==String, or similar, makes sense. The problem is that you can't define methods on specializations of types, only on the generic types. And Sequence.join makes no sense, because that implies the nonsensical Sequence<int>.join. (Neither of those applies to C++, where you can add methods to specializations, or add methods to the template and they just don't appear on specializations where they make no sense because of SFINAE. But I don't know of any other language that does things that way.) But since most of those languages have leaky type systems despite most of them supposedly being statically strongly typed, it "works". It's been fun to watch people try to do the same thing in Swift, which really is strongly typed, and will not let you define a Sequence.join method unless it compiles for any ElementType. Of course there's no such problem defining a String.join method whose parameter is S:Sequence where S.ElementType== String, so String.join works just fine. So anyone coming to Swift from Python can add a join method without even thinking twice, while anyone coming from any other language will fight with the compiler for a while, then give up and either copy to an ObjC NSArray types to escape the type system, or copy and paste their join loop all over their code. But anyway, pointing out that every OO language gets this wrong may be entertaining, but it doesn't help the fact that everyone coming to Python from those other languages looks for it in the wrong place.
On 08/10/2014 07:33 AM, Nick Coghlan wrote:
FWIW, I don't consider str.join *or* sum with an empty string as the starting point to be particularly intuitive ways of joining iterables of strings.
str.join was invented before we had keyword-only arguments as a common construct, and before print became an ordinary function that accepted a "sep" keyword-only argument.
I'd be interested in seeing a concrete proposal for a "concat" builtin that accepted a "sep" keyword only argument. Even if such a PEP ends up being rejected, it would hopefully help cut short the *next* potentially interminable thread on the topic by gathering the arguments for and against in a more readily accessible place.
I think the contrast between the built in function "sum", and the string method "join", is very interesting from a language design point of view, but I'm finding it hard to describe just why in only a few words. Others have pointed out some of the more detailed aspects of these issues, so here are some of the more general wider views I think come into play. * Generality / Speciality Their are advantages to both ends of this scale. A more general function is very convenient, while a more specialised function can offer a greater degree of control. * Complexity / Quantity A function with a more complex signature can be equivalent to several functions with simpler signatures. But as they become more complex, they also become more difficult to use. It's not obvious how the "sum" function fits into these scales, and I believe that may be why it tends to come up as something that needs to be fixed frequently. (how depends on the viewpoint of the fixer) If "sum" was a method on numbers, it would clearly be more specialised, or if it was defined to call a method of what ever objects it was given, it would clearly be more general. It is what it is, and I don't think there was any conscious consideration of these concepts when it was created. Probably practicality over purity was more of a factor at the time. I believe these concepts are more likely to be intuitively considered as the developers experiences increase instead of being consciously considered. So they aren't formally defined in any documentation. How a "concat" built in would relate to these concepts. Would "concat" be very general and delegate it's work to methods so it works on a variety of objects, or would it be limited to just strings? I'm also concerned we may add a new function just to compliment the "sum" function, so that both of them look better. I think any new function needs to fit into the language as a whole on it's own grounds with a clear propose and design. Food for thought, Ron
Getting back to the topic of this thread ... On Sun, Aug 10, 2014 at 8:33 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'd be interested in seeing a concrete proposal for a "concat" builtin that accepted a "sep" keyword only argument.
I don't think concat() is the same operation as sep.join(). I view concat as an operation that is naturally defined on a pair of sequences and can be naturally (via reduce) extended to apply to an iterable of sequences: z = concat(x, y) <=> len(z) = len(x) + len(y), z[i] = x[i] if i < len(x), z[i] = y[i - len(x)] otherwise. (Note that the definition works for any abstract sequence.) I find it odd that although CPython defines concatenation as a part of the sequence protocol at the implementation level, the abstract base class Sequence does not. This omission may present an opportunity to design Sequence.__concat__ and builtin concat() so that concatenation of iterables of sequences can be implemented efficiently in concrete types.
participants (16)
-
Alexander Belopolsky -
Alexander Heger -
Andrew Barnert -
Boris Borcic -
Ethan Furman -
Haoyi Li -
Mark Lawrence -
Masklinn -
Nathaniel Smith -
Nick Coghlan -
Ron Adam -
Stephen Hansen -
Stephen J. Turnbull -
Terry Reedy -
Todd -
Wolfgang Maier