Percent notation for array and string literals, similar to Perl, Ruby
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers. For me, the arrays are the most useful aspect. %w{one two three} => ["one", "two", "three"] I did a search, and I don't see that this has been suggested before, but I might have missed something. I'm guessing I'm not the first person to ask whether this seems like a desirable feature to add to Python.
"one two three".split() On Tue, Oct 22, 2019, 3:56 PM Steve Jorgensen <stevej@stevej.name> wrote:
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers.
For me, the arrays are the most useful aspect.
%w{one two three} => ["one", "two", "three"]
I did a search, and I don't see that this has been suggested before, but I might have missed something. I'm guessing I'm not the first person to ask whether this seems like a desirable feature to add to Python. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/D5RMXP... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Oct 22, 2019 at 3:54 PM Steve Jorgensen <stevej@stevej.name> wrote:
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers.
For me, the arrays are the most useful aspect.
%w{one two three} => ["one", "two", "three"]
I did a search, and I don't see that this has been suggested before, but I might have missed something. I'm guessing I'm not the first person to ask whether this seems like a desirable feature to add to Python.
I am not seeing the advantage of this. Can you provide some specific examples that you think would benefit from this syntax? For the example you gave, besides saving a few characters I don't see the advantage over the existing way we have to do that: 'one two three'.split() Python usually uses [ ] for list creation or indexing. Co-opting it for a substantially different purpose of string processing like this doesn't strike me as a good idea, especially since we have two string identifiers already, ' and ". Python does have something similar in function although different in syntax, its string prefixes. For example f-strings, r-strings, byte literals, etc. There have been proposals for supporting custom string prefixes, but none have gone anywhere. Other hard-coded string prefixes could, in principle, be done, but a strong case would need to be made for them. If we went with string prefixes, the only one I could see maybe having any traction would be a regex one, but I personally wouldn't see that as being common enough to warrant it. Converting strings to names is something I think should be discouraged rather than encouraged (we have dictionaries to handle arbitrary names), shell commands are complicated enough that I would think having a dedicated function is necessary and I think it would be an abuse of string literals, and the others duplicate features python already has as far as I can tell. So I would be -100 on using [ ] for strings in any way, +0 on a regex string prefix, and -1 on all the other corresponding string prefixes.
Todd wrote:
On Tue, Oct 22, 2019 at 3:54 PM Steve Jorgensen stevej@stevej.name wrote:
See I am not seeing the advantage of this. Can you provide some specific examples that you think would benefit from this syntax? For the example you gave, besides saving a few characters I don't see the advantage over the existing way we have to do that: 'one two three'.split()
No. It really doesn't provide much benefit beyond that.
Python usually uses [ ] for list creation or indexing. Co-opting it for a substantially different purpose of string processing like this doesn't strike me as a good idea, especially since we have two string identifiers already, ' and ".
Actually, in Ruby, the surrounding character pair can be pretty much anything `, and in practice, curly braces are often used. From https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... :
Any single non-alpha-numeric character can be used as the delimiter, `%[including these]`, `%?or these?`, `%~or even these things~`. By using this notation, the usual string delimiters `"` and `'` can appear in the string unescaped, but of course the new delimiter you've chosen does need to be escaped. However, if you use `%(parentheses)`, `%[square brackets]`, `%{curly brackets}` or `%<pointy brackets>` as delimiters then those same delimiters can appear unescaped in the string as long as they are in balanced pairs…
On Oct 22, 2019, at 15:06, Steve Jorgensen <stevej@stevej.name> wrote:
Actually, in Ruby, the surrounding character pair can be pretty much anything `, and in practice, curly braces are often used.
This seems like a prime example of “Ruby is Perl done right, Python is not doing Perl.”
Andrew Barnert wrote:
On Oct 22, 2019, at 15:06, Steve Jorgensen stevej@stevej.name wrote:
Actually, in Ruby, the surrounding character pair can be pretty much anything `, and in practice, curly braces are often used. This seems like a prime example of “Ruby is Perl done right, Python is not doing Perl.”
That's valid. Just throwing this out there to see how people feel about it — or maybe spark other more Pythonic ideas that might provide a similar convenience.
On Tue, Oct 22, 2019 at 04:11:45PM -0400, Todd wrote:
On Tue, Oct 22, 2019 at 3:54 PM Steve Jorgensen <stevej@stevej.name> wrote:
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers.
For me, the arrays are the most useful aspect.
%w{one two three} => ["one", "two", "three"]
I would expect %w{ ... } to return a set, not a list: %w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple and I would describe them as list/set/tuple "word literals". Unlike list etc displays [spam, eggs, cheese] these would actually be true literals that can be determined entirely at compile-time.
I am not seeing the advantage of this. Can you provide some specific examples that you think would benefit from this syntax?
I would use this feature, or something like it, a lot, especially in doctests where there is a premium in being able to keep examples short and on one line. Here is a small selection of examples from my code that would be improved by something like the suggested syntax. I have trimmed some of them for brevity, and to keep them on one line. (Anything with an ellipsis ... has been trimmed.) I have dozens more, but they'll all pretty similar and I don't want to bore you. __slots__ = ('key', 'value', 'prev', 'next', 'count') __all__ = ["Mode_Estimators", "Location", "mfv", ...] The "string literal".split() idiom is especially common, especially for data tables of strings. Here are some examples: NUMBERS = ('zero one two three ... twenty-eight twenty-nine').split() _TOKENS = set("indent assign addassign subassign ...".split()) __all__ = 'loopup loopdown reduce whileloop recursive product'.split() for i, colour in enumerate('Black Red Green Yellow Blue Magenta Cyan White'.split()): for methodname in 'pow add sub mul truediv'.split(): attrs = "__doc__ __version__ __date__ __author__ __all__".split() names = 'meta private dunder ignorecase invert'.split() unsorted = "The quick brown Fox jumps over the lazy Dog".split() blocks = chaff.pad('flee to south'.split(), key='george') minmax('aa bbbb c ddd eeeee f ggggg'.split(), key=len) My estimate is that I would use this "string literal".split() idiom: - about 60-70% in doctests; - about 5-10% in other tests; - about 25% in non-test code. Anyone who has had to write out a large, or even not-so-large, list of words could benefit from this. Why quote each word individually like a drudge, when the compiler could do it for you at compile-time? Specifically as a convenience for this "list of words" use-case, namedtuple splits a single string into words, e.g. namedtuple('Parameter', 'name alias default') I do the same in some of my functions as well, to make it easier to pass lists of words. Similarly, support for keyword arguments in the dict constructor was specifically added to ease the case where your keys were single words: # {'spam': 1, 'eggs': 2} dict(spam=1, eggs=2) Don't underestimate the annoyance factor of having to write out things by hand when the compiler could do it for you. Analogy: we have list displays to make it easy to construct a list: mylist = [2, 7, -1] but that's strictly unnecessary, since we could construct it like this: mylist = list() mylist.append(2) mylist.append(7) mylist.append(-1) If you think I'm being fascious about the list example, you've probably never used standard Pascal, which had arrays but no syntax to initialise them except via a sequence of assignments. That wasn't too bad if you could put the assignments in a loop, but was painful if the initial entries were strings or floats.
For the example you gave, besides saving a few characters I don't see the advantage over the existing way we have to do that:
'one two three'.split()
One of the reasons why Python is "slow" is that lots of things that can be done at compile-time are deferred to run-time. I doubt that splitting short strings will often be a bottle-neck, but idioms like this cannot help to contribute (even if only a little bit) to the extra work the Python interpreter does at run-time: load a pre-allocated string constant look up the "split" attribute in the instance (not found) look up the "split" attribute in the class call the descriptor protocol which returns a method call the method build and return a list garbage collect the string constant versus: build and return a list from pre-allocated strings (Or something like this, I'm not really an expert on the Python internals, I just pretend to know what I'm talking about.)
Python usually uses [ ] for list creation or indexing. Co-opting it for a substantially different purpose of string processing like this doesn't strike me as a good idea, especially since we have two string identifiers already, ' and ".
I'm not sure why you describe this as "string processing". The result you get is a list, not a string. This would be pure syntactic sugar for: %w[words] # "words".split() %w{words} # set("words".split()) %w(words) # tuple("words".split()) except done by the compiler, at compile-time, not runtime. -- Steven
On Wed, Oct 23, 2019 at 10:59 AM Steven D'Aprano <steve@pearwood.info> wrote:
For the example you gave, besides saving a few characters I don't see the advantage over the existing way we have to do that:
'one two three'.split()
One of the reasons why Python is "slow" is that lots of things that can be done at compile-time are deferred to run-time. I doubt that splitting short strings will often be a bottle-neck, but idioms like this cannot help to contribute (even if only a little bit) to the extra work the Python interpreter does at run-time:
load a pre-allocated string constant look up the "split" attribute in the instance (not found) look up the "split" attribute in the class call the descriptor protocol which returns a method call the method build and return a list garbage collect the string constant
versus:
build and return a list from pre-allocated strings
(Or something like this, I'm not really an expert on the Python internals, I just pretend to know what I'm talking about.)
This could be done as an optimization without changing syntax or semantics.. As long as the initial string is provided as a literal, it should be possible to call the method at compile time, since (as far as I know) every string method is a pure function. It's made a little more complicated by the problem of mutable return values (str.split() returns a list, and if you call it again, you have to get a new unique list in case one of them gets mutated), but if you immediately iterate over it, that won't be a problem. Currently, the CPython optimizer can recognize constructs like "if x in [1,2,3,4]" or "for x in [1,2,3,4]" and use a literal tuple instead of building a list. Recognizing the splitting of a string as another equivalent literal could be done the same way. Whether it's worthwhile or not is another question, but if the performance penalty of the run-time splitting is a problem, that CAN be fixed even without new syntax. ChrisA
On Oct 22, 2019, at 17:47, Chris Angelico <rosuav@gmail.com> wrote:
Currently, the CPython optimizer can recognize constructs like "if x in [1,2,3,4]" or "for x in [1,2,3,4]" and use a literal tuple instead of building a list. Recognizing the splitting of a string as another equivalent literal could be done the same way.
Whether it's worthwhile or not is another question, but if the performance penalty of the run-time splitting is a problem, that CAN be fixed even without new syntax.
This would be relatively easy to do in an AST-processing import hook. Then people could experiment with it and if someone finds real-life performance benefits, file a bug to add it to CPython (which should be a lot easier nowadays than it was a few versions ago).
On Wed, Oct 23, 2019 at 11:47:04AM +1100, Chris Angelico wrote:
This could be done as an optimization without changing syntax or semantics.. As long as the initial string is provided as a literal, it should be possible to call the method at compile time, since (as far as I know) every string method is a pure function.
Sure, it could be done as an optimization, similar to one of the proposals here: https://bugs.python.org/issue36906 It could also be done by a source code preprocessor, or an AST transformation, without changing syntax. But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using. -- Steven
23.10.19 13:08, Steven D'Aprano пише:
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
There is already the One Obvious Way, and you know it will work whatever version or implementation of Python you are using.
On Wed, Oct 23, 2019 at 01:42:11PM +0300, Serhiy Storchaka wrote:
23.10.19 13:08, Steven D'Aprano пише:
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
There is already the One Obvious Way, and you know it will work whatever version or implementation of Python you are using.
Your "One Obvious Way" is not obvious to me. Should I write this: # This is from actual code I have used. ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "twenty-one", "twenty-two", "twenty-three", "twenty-four" "twenty-five", "twenty-six", "twenty-seven", "twenty-eight", "twenty-nine", "thirty"] Or this? """zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty""".split() I've been told by people that if I use the first style I'm obviously ignorant and don't know Python very well, and by other people that the second one is a hack and that I would fail a code review for using it. So please do educate me Serhiy, which one is the One Obvious Way that we should all agree is the right thing to do? -- Steven
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often. colors1 = "red green blue".split() # happy Later colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately On Wed, Oct 23, 2019, 7:03 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Oct 23, 2019 at 01:42:11PM +0300, Serhiy Storchaka wrote:
23.10.19 13:08, Steven D'Aprano пише:
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
There is already the One Obvious Way, and you know it will work whatever version or implementation of Python you are using.
Your "One Obvious Way" is not obvious to me. Should I write this:
# This is from actual code I have used.
["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "twenty-one", "twenty-two", "twenty-three", "twenty-four" "twenty-five", "twenty-six", "twenty-seven", "twenty-eight", "twenty-nine", "thirty"]
Or this?
"""zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty""".split()
I've been told by people that if I use the first style I'm obviously ignorant and don't know Python very well, and by other people that the second one is a hack and that I would fail a code review for using it.
So please do educate me Serhiy, which one is the One Obvious Way that we should all agree is the right thing to do?
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZM3O3B... Code of Conduct: http://python.org/psf/codeofconduct/
On 23/10/2019 15:09, David Mertz wrote:
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often.
colors1 = "red green blue".split() # happy
Later
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
I'm seriously not getting the issue people have with colours1 = ["red", "green", "blue"] which has the advantage of saying what it means. -- Rhodri James *-* Kynesim Ltd
On Wed, Oct 23, 2019 at 03:16:51PM +0100, Rhodri James wrote:
I'm seriously not getting the issue people have with
colours1 = ["red", "green", "blue"]
which has the advantage of saying what it means.
As opposed to the alternatives, which say something different from what they mean? The existing alternative: "red green blue".split() equally "has the advantage of saying what it means", and so will the proposed alternative, just as it already does in Ruby. I know that code is read more than it's written, but it still has to be written, and maintained, and writing out long lists of words is annoying to write and tedious to read. An example like "red", "green", "blue" isn't too bad, but try it with 30 or more single-word strings. I have. 1 out of 5, would not recommend. Hand-writing repetitive, dumb, mechanical code is an anti-pattern. I'm sure that, somewhere out there, there's a coder who prefers to write: [mylist[1], mylist[2], mylist[3], mylist[4], mylist[5]] instead of the obvious slice, but most of us didn't become programmers because we love the tedious, repetitive boilerplate. [ QUOTE red QUOTE COMMA QUOTE green QUOTE COMMA QUOTE blue QUOTE COMMA QUOTE yellow QUOTE COMMA QUOTE magenta QUOTE COMMA ... ] Wherever possible, we should let the interpreter or compiler do the repetitive stuff. The average word length in English is five characters. That means that in a list of typical English words, more than a third of the expression is made up of the quotes and commas. In the example you give, there are twelve characters in the words themselves and eight characters worth of boilerplate surrounding them (quotes and commas, not including the spaces or brackets). -- Steven
On Thu, Oct 24, 2019 at 2:20 AM Steven D'Aprano <steve@pearwood.info> wrote:
Hand-writing repetitive, dumb, mechanical code is an anti-pattern. I'm sure that, somewhere out there, there's a coder who prefers to write:
[mylist[1], mylist[2], mylist[3], mylist[4], mylist[5]]
instead of the obvious slice, but most of us didn't become programmers because we love the tedious, repetitive boilerplate.
Siiiiiiiigh... that one actually strikes home with me. Some of my non-Python coding is in a language called SourcePawn, which doesn't have any sort of "bulk operations" like slicing or *args or anything. So I might have code like this: SmokeLog("[%d-A] Smoke (%.2f, %.2f, %.2f) - (%.2f, %.2f)", client, pos[0], pos[1], pos[2], angle[0], angle[1]); where "pos" and "angle" are vectors - arrays of three floating-point values. In Python, a Vector would be directly stringifiable, of course, but even if not, you could at least say *pos,*angle. So if someone is coming from a background in languages that can't do this sort of thing, then yes, Python's way doesn't "look like what it does". Quite frankly, that's a feature, not a flaw. It looks like what the programmer intends, instead of looking like what mechanically happens on the fly. We don't write code that looks like "push this value onto the stack, push that value onto the stack, add the top two values and leave the result on the stack", even though that's how CPython byte code works. We write code that says "a + b", because that's the programmer's intention. If your intention is to iterate over a series of words, you do not need all the mechanical boilerplate of constructing a list and properly delimiting all the pieces. In Python, we don't iterate over numbers by saying "start at 5, continue so long as we're below 20, and add 1 every time". We say "iterate over range(5, 20)". And Python is better for having that. (Trust me, I've messed up C-style for loops enough times to be 100% certain of that.) You might argue that a blank-separated words notation is unnecessary, but it should be obvious that it's a valid way of expressing *programmer intention*. ChrisA
23.10.19 18:16, Steven D'Aprano пише:
The average word length in English is five characters. That means that in a list of typical English words, more than a third of the expression is made up of the quotes and commas. In the example you give, there are twelve characters in the words themselves and eight characters worth of boilerplate surrounding them (quotes and commas, not including the spaces or brackets).
This would be a good argument if Python be a write-only language.
On Thu, Oct 24, 2019 at 2:39 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
23.10.19 18:16, Steven D'Aprano пише:
The average word length in English is five characters. That means that in a list of typical English words, more than a third of the expression is made up of the quotes and commas. In the example you give, there are twelve characters in the words themselves and eight characters worth of boilerplate surrounding them (quotes and commas, not including the spaces or brackets).
This would be a good argument if Python be a write-only language.
I'm pretty sure the character counts are the same whether you're reading or writing. If anything, writing is based on keystrokes, but reading is based on characters. ChrisA
On Wed, Oct 23, 2019 at 11:44 AM Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Oct 24, 2019 at 2:39 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
23.10.19 18:16, Steven D'Aprano пише:
The average word length in English is five characters. That means that in a list of typical English words, more than a third of the expression is made up of the quotes and commas. In the example you give, there are twelve characters in the words themselves and eight characters worth of boilerplate surrounding them (quotes and commas, not including the spaces or brackets).
This would be a good argument if Python be a write-only language.
I'm pretty sure the character counts are the same whether you're reading or writing. If anything, writing is based on keystrokes, but reading is based on characters.
Reading really isn't based on characters. People generally read words as a single unit rather than reading each character individually.
On Wed, Oct 23, 2019 at 8:45 AM Chris Angelico <rosuav@gmail.com> wrote:
This would be a good argument if Python be a write-only language.
I'm pretty sure the character counts are the same whether you're reading or writing. If anything, writing is based on keystrokes, but reading is based on characters.
It's not that simple -- it takes more work to type the quotes -- it may take more work to read them, but they provide useful information -- this is a string. If I see: colors = ["red", "green", "blue"] It is VERY clear to me, at a glance, that it is a list of strings. but when I see: colors = "red, green, blue".split() I need to think about it a bit. As for: %w[red green blue] The [] make it pretty clear at a glance that I'm dealing with a list -- but the lack of quotes is really likely to confuse me -- particularly if I have identifiers with similar names! and: %w[1 2 3] would really take a cognitive load to remember that that is a list of strings. I won't say that I (as a pretty bad typist) don't get annoyed at having to type quotes a lot, but I really do appreciate that clear distinction between identifiers and strings when reading code. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Wed, Oct 23, 2019 at 09:06:49AM -0700, Christopher Barker wrote:
As for:
%w[red green blue]
The [] make it pretty clear at a glance that I'm dealing with a list -- but the lack of quotes is really likely to confuse me -- particularly if I have identifiers with similar names!
In another comment, you asserted that we all have editors that help with typing quotes. Don't you have an editor that formats identifiers differently from string literals? I predict that even without colour or stylistic hinting, people will soon get used to the syntax. The fact that space-seperated identifiers are not legal in Python is a pretty huge hint that these aren't identifiers. Virtually overnight, the Python community got used to the opposite change, with f-strings: something that looks like a string is actually code containing identifiers and even arbitrary expressions: f"Your score is {score}" so I don't believe that this will be anywhere near the cognitive load that you state, especially if you are using an editor that displays strings in the different style to identifiers or numbers. -- Steven
On Wed, Oct 23, 2019 at 1:41 PM Steven D'Aprano <steve@pearwood.info> wrote:
In another comment, you asserted that we all have editors that help with typing quotes. Don't you have an editor that formats identifiers differently from string literals?
OK -- but which is it? do we expect people to have smart editors or not? If we do then these are essentially equivalent in ease of reading and writing, and if not, then the new way is easier to write, but harder to read (and frankly, I think harder to write correctly if there is white space in the individual strings. An example of that: I think it's really handy that Python allows me to use " as a string delimiter when writing actual text, so I don't have to escape the apostrophe. (and vice versa for the less common " in the actual string) escaping is a pain and error prone -- and worse when you need to use codes: "\x20" is at least a bit harder than "\n" -- at least "\n" is a nice mnemonic. And "\u0020 is even worse. Without an actual study, we are all going with our gut here, but I doubt I'd ever use this except for simple collections of strings that don't have spaces in them. So then there are now Three ways, rather than two obvious ways to do it :-)
I predict that even without colour or stylistic hinting, people will soon get used to the syntax. The fact that space-seperated identifiers are not legal in Python is a pretty huge hint that these aren't identifiers.
nor really, because while in a list, you need commas, in regular code, space is (at least conventionally) used to separate identifies and tokens -- that space doesn't scream out at me.
Virtually overnight, the Python community got used to the opposite change, with f-strings: something that looks like a string is actually code containing identifiers and even arbitrary expressions:
f"Your score is {score}"
so I don't believe that this will be anywhere near the cognitive load
well, it's technically code, yes, but it's functionally still a string -- it looks like a string, and it evaluates to a string. I don't think that's analogous. that you state Again, this is all gut feeling, but we're talking about adding something new here -- a tiny bit better, and maybe worse for some, is NOT enough to add a new feature. I can't keep track of who's who, but quite amazing to me that this is getting traction, and on the next thread over (some) people seem convinced that dict1 + dict2 would be incredibly confusing! oh well, language design is hard. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Oct 24, 2019 at 9:12 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Oct 23, 2019 at 1:41 PM Steven D'Aprano <steve@pearwood.info> wrote:
Virtually overnight, the Python community got used to the opposite change, with f-strings: something that looks like a string is actually code containing identifiers and even arbitrary expressions:
f"Your score is {score}"
well, it's technically code, yes, but it's functionally still a string -- it looks like a string, and it evaluates to a string. I don't think that's analogous.
An f-string is syntactic sugar for something (very approximately) like: "".join("Your score is ", format(score)) Is that a string? It results in a string. Is a list comprehension a list? It results in a list. Programmer intention and concrete implementation are completely different. Having syntactic sugar for the creation of a list of strings is quite different from having a string which you then split, even if the implementation is a string being split.
so I don't believe that this will be anywhere near the cognitive load that you state
Again, this is all gut feeling, but we're talking about adding something new here -- a tiny bit better, and maybe worse for some, is NOT enough to add a new feature.
I can't keep track of who's who, but quite amazing to me that this is getting traction, and on the next thread over (some) people seem convinced that
dict1 + dict2 would be incredibly confusing!
oh well, language design is hard.
Yeah, well... welcome to the insanity that we call "python-ideas" :) ChrisA
On 23/10/2019 16:16, Steven D'Aprano wrote:
On Wed, Oct 23, 2019 at 03:16:51PM +0100, Rhodri James wrote:
I'm seriously not getting the issue people have with
colours1 = ["red", "green", "blue"]
which has the advantage of saying what it means.
As opposed to the alternatives, which say something different from what they mean?
Well, yes. ["red", "green", "blue"] says that this is a list of strings. End of. "red green blue".split() says that this is a string that is now -- ta dah! -- a list of strings. Nothing up my sleeves. No, don't clap, just throw money. It's only a little bit of extra cognitive load in this case, but then you start meeting corner cases like wanting spaces in your strings and it stops being nearly so little. The proposed: %w[red green blue] says that this is something, good luck figuring out what. If you know, it's only a little more cognitive load, but again gets messier as you get into the corner cases, as you've been demonstrating. If you don't know, looking it up is not going to be easy.
Wherever possible, we should let the interpreter or compiler do the repetitive stuff.
I prefer to let my editor do the work, actually. When I have had to do long lists of strings (or anything, really) like this, I mostly type it in as: NOTIONAL_CONSTANT = [ red blue greeeeeen burnt umber burnt cake really long name with lots of spaces in it and so on and so on ] and then write a quick editor macro to add the quotes and comma and tab into a more beautiful (and syntactically correct) form. Not much more trouble than typing it all in as an escaped string, and no extra runtime loading either. The result is immediately readable source, which I consider a major win. -- Rhodri James *-* Kynesim Ltd
On Wed, Oct 23, 2019 at 06:01:06PM +0100, Rhodri James wrote:
The proposed:
%w[red green blue]
says that this is something, good luck figuring out what.
You don't need *luck* to figure out what it does, you need five seconds in the REPL. One of the most annoying tendencies on this mailing list is for people who dislike a feature to play dumb. "I know decorators, threads, multiprocessing and unicode, classes and metaclasses, protocols from ftp to smtp and beyond, I am fluent in Python, Javascript, Emacs Lisp and C, I know git and django and pandas, I fear not unit testing or continuous integration, but learning what ``%w[...]`` means will forever be beyond me!!!" If you could learn that [...] means a list display or a list comp depending on the contents, you can learn this. As I said before, I'm not wedded to this particular syntax, but its an obvious mnemonic: w is for *words* [ ] are *list delimiters* Put them together and you get a list of words.
Wherever possible, we should let the interpreter or compiler do the repetitive stuff.
I prefer to let my editor do the work, actually. [...] and then write a quick editor macro to add the quotes and comma
Great. And how about those who cannot just "write a quick editor macro" which works perfectly first time? If writing out a list of words in Python source code is so painful that you prefer to write a macro, that's a fantastic argument in favour of this new syntax! -- Steven
My one comment about this is to quote from PEP 20, the Zen of Python There should be one-- and preferably only one --obvious way to do it. Yes, this does get broken at times with additions to the language, But this whole proposal to me seems to be an effort to introduce an alternate way to do something that can be fairly easily done with basic syntax. Yes, maybe in some cases, the existing syntax has a lot of boiler plate overhead (all those quotes, what a mess), but we also have from the Zen Special cases aren't special enough to break the rules. Which I think is also applicable. It isn't that Python users can't learn a new special syntax, the question comes should they. What does this idea let you do in Python that you COULDN'T do before. Python is its own language, it doesn't need to import all the little tricks of all the other languages.
24.10.19 14:27, David Mertz пише:
On Thu, Oct 24, 2019, 7:19 AM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:
My one comment about this is to quote from PEP 20, the Zen of Python
There should be one-- and preferably only one --obvious way to do it.
No problem, the new syntax doesn't risk being obvious! ;-)
:-D
On 24/10/2019 11:33, Steven D'Aprano wrote:
On Wed, Oct 23, 2019 at 06:01:06PM +0100, Rhodri James wrote:
The proposed:
%w[red green blue]
says that this is something, good luck figuring out what.
You don't need *luck* to figure out what it does, you need five seconds in the REPL.
One of the most annoying tendencies on this mailing list is for people who dislike a feature to play dumb.
Sigh. You may have noticed that I was being slightly flip in all my descriptions, mostly to point up the different levels of cognitive load imposed by them. The fact is, %w[...] doesn't look like anything else Python does, and it's seeking to replace an absolutely bog-standard literal.
I prefer to let my editor do the work, actually. [...] and then write a quick editor macro to add the quotes and comma
Great. And how about those who cannot just "write a quick editor macro" which works perfectly first time?
If writing out a list of words in Python source code is so painful that you prefer to write a macro, that's a fantastic argument in favour of this new syntax!
Honestly, it's not that painful. You were the one contending that it was painful, I was demonstrating a method of avoiding the pain (that I use more out of laziness) that didn't put the load on the compiler every single time I run the script. People who can't just write editor macros have a few choices. Obviously they can learn how to write macros for their editor (and honestly, Emacs keyboard macros are pretty literally "monkey see, monkey do"). They can write the quotes and commas themselves, which isn't much more boring that writing the text in between the quotes. Or they can write a long string escaping any spaces perfectly first time, and split it. I think the strongest argument against both this proposal and the habit of using split() is that everyone looking at your example string of colours, including you and me, missed "forest green" the first time round. -- Rhodri James *-* Kynesim Ltd
On Thu, Oct 24, 2019, 9:01 AM Rhodri James
I think the strongest argument against both this proposal and the habit of using split() is that everyone looking at your example string of colours, including you and me, missed "forest green" the first time round.
I noticed that. Is forest green such an unfamiliar color? I put extra spaces between the colors (and just one inside each multi-word one) to those the intent. But yes, that's similar to the kind of error to I've made. I have a collection of single word items, then I accidentally add a multi-word one without thinking about the logic.
On Wed, Oct 23, 2019 at 10:09:41AM -0400, David Mertz wrote:
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often.
colors1 = "red green blue".split() # happy
Later
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
It isn't shared by the proposal. colors2 = %w[cyan forest green burnt\x20umber] Escaping the space ``\ `` might be nicer, but escaping an invisible character is problematic (see the problems with the explict line continuation character ``\``) and we may not be able to add any new escape characters to the language. However a hex escape will do the trick. -- Steven
I have to say that I'm really surprised that this idea is gaining this much traction. And this is why: Shorthand for a list of stings, whether this proposal, or the "list of strings".split() "hack" -- is useful primarily for what I"d call "scripting",rather than "software development". There is not clear distinction, of course, but in (my definition of) scripting, the write:read ratio (and the write:everything-else ratio: e.g. runinng, testing, debugging, reviewing) is much higher, and it is a lot more common to have a bunch of literals. I know I often put a pile of literals a the top of a script, whereas a program would use a config file, or command line arguments, or pull data from a database or web service, or ..... So why am I surprised? Because Python, over the years, has become more of a "programming language", and a bit less of a scripting language. print x => print(x) is a prime example -- but there are many others. I'd say f-strings are the only exception I can think of of a feature that is probably more useful to "scripting" than "programming". But less so than this proposal. On to this one -- despite the fact that I do a fair bit of quicky scripting, I don't think this is worth it -- it's really only useful for a particular subset of lists of strings -- once you add escapoing whitepace and all that (and what do you do with quotes?, it isn't a good general solution. Sure it's a common use case, but then, the "a bunch of words".split() solution is fine in that case. As for "one obvious way to do it" -- that is aspirational -- there simply can't be one obvious way to do everything. And sometimes "it" is not one thing. I'd say: if you need to build a quick list of simple single words that isn't likely to get more complex, then use .split(), if you need to build a list of strings that are not simple words, and/or may get more complex, then use the full set of quotation marks. Final point: ideally, we all have editors that help with the quotes, so it's not *quite* as much extra typing. TL;DR -- not a really wide use case, and makes the language that much more "PERL-like". -CHB On Wed, Oct 23, 2019 at 8:00 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Oct 23, 2019 at 10:09:41AM -0400, David Mertz wrote:
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often.
colors1 = "red green blue".split() # happy
Later
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
It isn't shared by the proposal.
colors2 = %w[cyan forest green burnt\x20umber]
Escaping the space ``\ `` might be nicer, but escaping an invisible character is problematic (see the problems with the explict line continuation character ``\``) and we may not be able to add any new escape characters to the language. However a hex escape will do the trick.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TCIWZC... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Wed, Oct 23, 2019 at 10:59 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Oct 23, 2019 at 10:09:41AM -0400, David Mertz wrote:
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often.
colors1 = "red green blue".split() # happy
Later
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
It isn't shared by the proposal.
colors2 = %w[cyan forest green burnt\x20umber]
Escaping the space ``\ `` might be nicer, but escaping an invisible character is problematic (see the problems with the explict line continuation character ``\``) and we may not be able to add any new escape characters to the language. However a hex escape will do the trick.
Compare that to: colors2 = "cyan,forest green,burnt umber".split(',') or, if you follow pep8-style commas: colors2 = "cyan, forest green, burnt umber".split(', ') This is one of the many cases where being able to specify the delimiter helps.
On Wed, Oct 23, 2019 at 11:59:44AM -0400, Todd wrote:
Compare that to:
colors2 = "cyan,forest green,burnt umber".split(',')
Sure, that's not going away. But consider that you're using this inside a tight loop: for something in lots_of_items: for another in more_items: function(spam, eggs, "cyan,forest green,burnt umber".split(',')) That's easy to fix, you say. Move the list outside the loop: L = "cyan,forest green,burnt umber".split(',')) for something in lots_of_items: for another in more_items: function(spam, eggs, L) What's wrong with this picture? -- Steven
On 10/23/2019 01:08 PM, Steven D'Aprano wrote:
On Wed, Oct 23, 2019 at 11:59:44AM -0400, Todd wrote:
Compare that to:
colors2 = "cyan,forest green,burnt umber".split(',')
Sure, that's not going away. But consider that you're using this inside a tight loop:
for something in lots_of_items: for another in more_items: function(spam, eggs, "cyan,forest green,burnt umber".split(','))
If you have a tight loop that is a performance bottleneck (you did measure, right?), then you trade readability for performance. This is not news.
That's easy to fix, you say. Move the list outside the loop:
L = "cyan,forest green,burnt umber".split(',')) for something in lots_of_items: for another in more_items: function(spam, eggs, L)
What's wrong with this picture?
Other than you're now using one list for all calls, and the function could by modifying that list? You do know if the function modifies the list, right? I give up, what is wrong with that picture? -- ~Ethan~
On Thu, Oct 24, 2019 at 7:53 AM Ethan Furman <ethan@stoneleaf.us> wrote:
for something in lots_of_items: for another in more_items: function(spam, eggs, "cyan,forest green,burnt
umber".split(','))
If you have a tight loop that is a performance bottleneck (you did measure, right?), then you trade readability for performance. This is not news.
That's easy to fix, you say. Move the list outside the loop:
L = "cyan,forest green,burnt umber".split(',')) for something in lots_of_items: for another in more_items: function(spam, eggs, L)
What's wrong with this picture?
Other than you're now using one list for all calls, and the function could by modifying that list? You do know if the function modifies the list, right? I give up, what is wrong with that picture?
Imagine that was his point -- the two codes are not equivalent if the list if modified by the function. I'll confess that I didn't immediately see that -- but then again, I'm not writing the code and thinking about what it's actually supposed to do, But of course we can come up with toy examples of where you *could* have to put a list of string literals inside a tight loop, but I'm having trouble imagining real-world cases. And I'm pretty sure that most folks think passing a list into a function and having it modified is "bad form" anyway. But where it does make sense for a function to modify a list, it's pretty unlikely that you would want to call it with the exact same initial list anyway -- if it's getting modified, presumably you want to preserve that modification, no? This does bring up a point though: I often think that folks use lists be default, where they don't need it to be mutable, and very well might not want it to be mutable (see above -- there is no way for the caller to know if that list is modified without reading the source -- if you wanted to be sure, you could pass in tuple of string instead. It's long standing practice to use a tuple when you *need* it to be immutable, and a list otherwise. but maybe we should have an easy way to make a tuple of strings as well? Just sayin' But again: my point (I think I started this sub-thread) was that performance is NOT a good reason to add this feature, and I'm still surprised that people are still bringing it up. -CHB
-- ~Ethan~ _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/B5SRQ4... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Todd wrote:
Compare that to: colors2 = "cyan,forest green,burnt umber".split(',') or, if you follow pep8-style commas: colors2 = "cyan, forest green, burnt umber".split(', ') This is one of the many cases where being able to specify the delimiter helps.
That's a very useful suggestion. :)
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
It isn't shared by the proposal.
colors2 = %w[cyan forest green burnt\x20umber]
I don't get it. There is weird escaping of spaces that aren't split? That is confusing and a bug magnet. What are the rules for escaping all whitespace, exactly? All the Unicode space-like code points, or just x20? Plus your example doesn't capture the color "forest green" correctly in any way I can imagine. But I suppose more weird escapes in the middle could do that. Overall... the proposal becomes incredibly ugly, and probably more characters that are harder to type, than existing syntax. Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Wed, Oct 23, 2019 at 12:02:37PM -0400, David Mertz wrote:
colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
Ha, speaking about "Oops" moments, I *totally* failed to notice that "forest green" is intended to be a single colour. The perils of posting in the wee hours of the morning, sorry.
It isn't shared by the proposal.
colors2 = %w[cyan forest green burnt\x20umber]
I don't get it. There is weird escaping of spaces that aren't split?
The source code has spaces between cyan and "forest-green" (let's pretend that's what it said all along...) and between forest-green and "burnt\x20umber". The parser/lexer splits on whitespace in the source code, giving three tokens: cyan forest-green burnt\x20umber each of which are treated as strings, complete with standard string escaping.
That is confusing and a bug magnet.
David, you literally wrote the book on text processing in Python. I think you are being disingenious here, and below when you describe a standard string hex-escape \x20 that has been in Python forever and in just about all C-like languages as "weird". If you can understand why this works: string = "Single\n quoted\n string\n containing newlines!" you can understand the burnt\x20umber example.
What are the rules for escaping all whitespace, exactly? All the Unicode space-like code points, or just x20?
(1) I am assuming that we don't change any of the existing string escapes. That would be a backwards-incompatible change that would change the meaning of existing strings. (2) The parser splits on whitespace in the source code. After that, the tokens are treated as normal string tokens except that you don't need to put start/end delimiters (quotes) on them. -- Steven
On Oct 23, 2019, at 13:10, Steven D'Aprano <steve@pearwood.info> wrote:
David, you literally wrote the book on text processing in Python. I think you are being disingenious here, and below when you describe a standard string hex-escape \x20 that has been in Python forever and in just about all C-like languages as "weird".
I think what he’s saying is that it’s weird that \x20 doesn’t count as white space here, when it literally means a space character. We do have to deal with this kind of weirdness in regexes, and that’s part of the reason we have raw strings literal, and this is no more confusing than passing a raw string literal to re.compile. But arguably it’s also no _less_ confusing than passing a raw to re.compile, and that does actually confuse people, and now we’re talking about promoting that kind of confusion from a parser buried inside a module that novices don’t have to use to the actual Python parser that handles every line you type.
If you can understand why this works:
string = "Single\n quoted\n string\n containing newlines!"
you can understand the burnt\x20umber example.
Not really. Your string contains new lines; it also contains spaces. Your burnt\x20umber example doesn’t contain a space. Or, rather, it doesn’t contain a space that separates the elements, but one of the elements does anyway. As if this: strings = "Single\n quoted\n string\n containing newlines!".splitlines() … gave you a list of one string that contains new lines instead of a list of three strings that don’t.
On Wed, Oct 23, 2019, 4:31 PM Steven D'Aprano
David, you literally wrote the book on text processing in Python. I think you are being disingenious here, and below when you describe a standard string hex-escape \x20 that has been in Python forever and in just about all C-like languages as "weird".
I'm so flattered anyone remembers that from long ago. It was a very fun book to write. :-) I think, however, that I've never written '\x20' before this moment in my life. I do know the ASCII and Unicode code point for a space. I've run the 'hexdump' utility plenty of times. But it's hard to think of an occasion when I would have needed to enter a space by code point rather than just quoted. So I don't think it's so disingenuous to think needing to do that would be "weird." I've escaped lots of other characters that don't have a giant key about 7x the width of other keys on my keyboard. If you can understand why this works:
string = "Single\n quoted\n string\n containing newlines!" you can understand the burnt\x20umber example.
I can discern your intention for the new behavior, yes. But: In [2]: "burnt\x20umber".split() Out[2]: ['burnt', 'umber'] In [3]: "Single\n quoted\n string\n containing newlines!".split() Out[3]: ['Single', 'quoted', 'string', 'containing', 'newlines!'] So this new syntax would behave in a way that is counter-intuitive for folks familiar with Python strings to date. Also, I genuinely am not clear what should happen if an expression like %w[cyan forest green burnt\x20umber] Contains any of the following (non-escaped) characters. If they occur inside quotes, it seems straightforward, but in this new '%w[]' thing, who knows? U+00A0 NO-BREAK SPACE foo bar As a space, but often not adjusted U+1680 OGHAM SPACE MARK foo bar Unspecified; usually not really a space but a dash U+180E MONGOLIAN VOWEL SEPARATOR foobar 0 U+2000 EN QUAD foo bar 1 en (= 1/2 em) U+2001 EM QUAD foo bar 1 em (nominally, the height of the font) U+2002 EN SPACE (nut) foo bar 1 en (= 1/2 em) U+2003 EM SPACE (mutton) foo bar 1 em U+2004 THREE-PER-EM SPACE (thick space) foo bar 1/3 em U+2005 FOUR-PER-EM SPACE (mid space) foo bar 1/4 em U+2006 SIX-PER-EM SPACE foo bar 1/6 em U+2007 FIGURE SPACE foo bar “Tabular width”, the width of digits U+2008 PUNCTUATION SPACE foo bar The width of a period “.” U+2009 THIN SPACE foo bar 1/5 em (or sometimes 1/6 em) U+200A HAIR SPACE foo bar Narrower than THIN SPACE U+200B ZERO WIDTH SPACE foobar 0 U+202F NARROW NO-BREAK SPACE foo bar Narrower than NO-BREAK SPACE (or SPACE), “typically the width of a thin space or a mid space” U+205F MEDIUM MATHEMATICAL SPACE foo bar 4/18 em U+3000 IDEOGRAPHIC SPACE foo bar The width of ideographic (CJK) characters. U+FEFF
On Wed, Oct 23, 2019 at 7:17 PM David Mertz <mertz@gnosis.cx> wrote:
Contains any of the following (non-escaped) characters. If they occur inside quotes, it seems straightforward, but in this new '%w[]' thing, who knows?
U+00A0 NO-BREAK SPACE foo bar As a space, but often not adjusted U+1680 OGHAM SPACE MARK foo bar Unspecified; usually not really a space but a dash U+180E MONGOLIAN VOWEL SEPARATOR foobar 0 U+2000 EN QUAD foo bar 1 en (= 1/2 em)
... To be fair, I also don't know which of those split on str.split() with no arguments to the method either. Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Oct 23, 2019, at 16:26, David Mertz <mertz@gnosis.cx> wrote:
To be fair, I also don't know which of those split on str.split() with no arguments to the method either.
I would assume the rule is the same rule used by str.isspace, and that this rule is either the simple one (category is Zs) or the full one (category is Zs or bidi class is one of the handful of bidi space classes) from the same version of Unicode that the unicodedata module handles. In fact, it’s more than an assumption—if it isn’t true, I’d expect to find a good rationale in the docs, or it’s probably a bug in the str class. You can’t document something as a method of Unicode strings that splits on “whitespace” using anything other than a Unicode definition of whitespace is without a good reason.
On Wed, Oct 23, 2019 at 5:53 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
To be fair, I also don't know which of those split on str.split() with no arguments to the method either.
I couldn't resist -- the answer is most of them: #!/usr/bin/env python weird_spaces = ("x\u0020x\u00A0x\u1680x\u180Ex\u2000x\u2001x\u2002" "x\u2003x\u2004x\u2005x\u2006x\u2007x\u2008x\u2009" "x\u200Ax\u200Bx\u202Fx\u205Fx\u3000x\uFEFFx") print(weird_spaces) splitted = weird_spaces.split() print(splitted) print(len(weird_spaces)) print(len(splitted)) $ python weird_spaces.py x x x xx x x x x x x x x x x xx x x xx ['x', 'x', 'x', 'x\u180ex', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x\u200bx', 'x', 'x', 'x\ufeffx'] 41 18 -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Is this the same code points identified by `str.isspace`? Thanks for doing that. I would have soon otherwise. Still, "most of them" isn't actually a precise answer for an uncertain string. :-) On Wed, Oct 23, 2019, 8:57 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Oct 23, 2019 at 5:53 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
To be fair, I also don't know which of those split on str.split() with no arguments to the method either.
I couldn't resist -- the answer is most of them:
#!/usr/bin/env python weird_spaces = ("x\u0020x\u00A0x\u1680x\u180Ex\u2000x\u2001x\u2002" "x\u2003x\u2004x\u2005x\u2006x\u2007x\u2008x\u2009" "x\u200Ax\u200Bx\u202Fx\u205Fx\u3000x\uFEFFx") print(weird_spaces) splitted = weird_spaces.split() print(splitted)
print(len(weird_spaces)) print(len(splitted))
$ python weird_spaces.py x x x xx x x x x x x x x x x xx x x xx ['x', 'x', 'x', 'x\u180ex', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x\u200bx', 'x', 'x', 'x\ufeffx'] 41 18
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Wed, Oct 23, 2019 at 6:04 PM David Mertz <mertz@gnosis.cx> wrote:
Is this the same code points identified by `str.isspace`?
I haven't checked -- so I will: and the answer is no: $ python weird_spaces.py x x x xx x x x x x x x x x x xx x x xx ['x', 'x', 'x', 'x\u180ex', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x\u200bx', 'x', 'x', 'x\ufeffx'] 41 18 [False, True, False, True, False, True, False, False, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, False, False, True, False, True, False, True, False, False, False] There are only three that didn't split, but many more than three that failed .isspace. Thanks for doing that. I would have soon otherwise. Still, "most of them"
isn't actually a precise answer for an uncertain string. :-)
nope. But it could be defined somewhere, and presumably is, though maybe not consistently. -CHB On Wed, Oct 23, 2019, 8:57 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Oct 23, 2019 at 5:53 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
To be fair, I also don't know which of those split on str.split() with no arguments to the method either.
I couldn't resist -- the answer is most of them:
#!/usr/bin/env python weird_spaces = ("x\u0020x\u00A0x\u1680x\u180Ex\u2000x\u2001x\u2002" "x\u2003x\u2004x\u2005x\u2006x\u2007x\u2008x\u2009" "x\u200Ax\u200Bx\u202Fx\u205Fx\u3000x\uFEFFx") print(weird_spaces) splitted = weird_spaces.split() print(splitted)
print(len(weird_spaces)) print(len(splitted))
$ python weird_spaces.py x x x xx x x x x x x x x x x xx x x xx ['x', 'x', 'x', 'x\u180ex', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x\u200bx', 'x', 'x', 'x\ufeffx'] 41 18
-CHB
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
D'uh! stupid bug:
Is this the same code points identified by `str.isspace`?
I haven't checked -- so I will:
and the answer is no:
wrong, the answer is yes:
$ python weird_spaces.py x x x xx x x x x x x x x x x xx x x xx ['x', 'x', 'x', 'x\u180ex', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x\u200bx', 'x', 'x', 'x\ufeffx'] out of 20, 17 were used as split chars out of 20, 17 were True according to .isspace That makes far more sense. Since I'm doing this, the three that aren't are: U+180E MONGOLIAN VOWEL SEPARATOR U+200B ZERO WIDTH SPACE U+FEFF ZERO WIDTH NO-BREAK SPACE The Mongolian vowel separator makes some sense (not knowing Mongolian in the least). Though I wonder what the point of a zero-width space is if it's NOT going to be a separator? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Oct 23, 2019, at 18:59, Christopher Barker <pythonchb@gmail.com> wrote:
Since I'm doing this, the three that aren't are:
U+180E MONGOLIAN VOWEL SEPARATOR U+200B ZERO WIDTH SPACE U+FEFF ZERO WIDTH NO-BREAK SPACE
The Mongolian vowel separator makes some sense (not knowing Mongolian in the least). Though I wonder what the point of a zero-width space is if it's NOT going to be a separator?
It’s a Cf (formatting character), because it’s not used for spacing, it’s used for controlling higher-level formatting like soft line breaks. Or, put another way, it’s a bit more like a soft hyphen than it is like a space. It’s a weird distinction, but not as weird as, say, U+2028 and U+2029, which are also used for controlling formatting but literally have “separator” in their name, so they ended up creating a special category for each one so they can be Z but not Zs. Anyway, some of the answers the Unicode committee came up with are odd, but they’re the right answers by definition. Plus, even if I had a time machine and an unlimited life span, I’m pretty sure I wouldn’t want to participate in those arguments.
On 10/23/19 11:27 PM, Andrew Barnert via Python-ideas wrote:
On Oct 23, 2019, at 18:59, Christopher Barker <pythonchb@gmail.com> wrote:
Since I'm doing this, the three that aren't are:
U+180E MONGOLIAN VOWEL SEPARATOR U+200B ZERO WIDTH SPACE U+FEFF ZERO WIDTH NO-BREAK SPACE
The Mongolian vowel separator makes some sense (not knowing Mongolian in the least). Though I wonder what the point of a zero-width space is if it's NOT going to be a separator? It’s a Cf (formatting character), because it’s not used for spacing, it’s used for controlling higher-level formatting like soft line breaks. Or, put another way, it’s a bit more like a soft hyphen than it is like a space. It’s a weird distinction, but not as weird as, say, U+2028 and U+2029, which are also used for controlling formatting but literally have “separator” in their name, so they ended up creating a special category for each one so they can be Z but not Zs.
Anyway, some of the answers the Unicode committee came up with are odd, but they’re the right answers by definition. Plus, even if I had a time machine and an unlimited life span, I’m pretty sure I wouldn’t want to participate in those arguments.
My understanding was that the ZWNBS was provide a way to logically separate two code-points that would otherwise like to bind together for meaning. -- Richard Damon
David Mertz wrote:
But it's hard to think of an occasion when I would have needed to enter a space by code point rather than just quoted.
If you ever desperately need to enter some Python code using a keyboard with a broken space key, you might be glad to have the option! -- eval('\x70\x72\x69\x6e\x74\x28\x27\x4f\x62\x66\x75\x73\x63\x61\x74\x65\x64\x6c\x79\x20\x79\x72\x73\x2c\x5c\x6e\x47\x72\x65\x67\x27\x29')
David Mertz wrote:
One big problem with the current obvious way would be shared by the proposal. This hits me fairly often. colors1 = "red green blue".split() # happy Later colors2 = "cyan forest green burnt umber".split() # oops, not what I wanted, quote each separately
Good point. In Ruby, this could be written correctly as `%w{cyan forest green burnt\ umber}`
23.10.19 14:00, Steven D'Aprano пише:
On Wed, Oct 23, 2019 at 01:42:11PM +0300, Serhiy Storchaka wrote:
23.10.19 13:08, Steven D'Aprano пише:
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
There is already the One Obvious Way, and you know it will work whatever version or implementation of Python you are using.
Your "One Obvious Way" is not obvious to me. Should I write this:
# This is from actual code I have used.
["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "twenty-one", "twenty-two", "twenty-three", "twenty-four" "twenty-five", "twenty-six", "twenty-seven", "twenty-eight", "twenty-nine", "thirty"]
Or this?
"""zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty""".split()
I've been told by people that if I use the first style I'm obviously ignorant and don't know Python very well, and by other people that the second one is a hack and that I would fail a code review for using it.
So please do educate me Serhiy, which one is the One Obvious Way that we should all agree is the right thing to do?
If you need a constant number, the most obvious way is to write it as a number literal, not int('123'). If you need a constant string, the most obvious way is to write it as a string literal, not bytes([65, 66]).decode(). If you need a list of constant strings, the most obvious way is to write it as a list display consisting of string literals. It works in all Python versions. The second way works too in all actual Python versions (starting from 1.6), and nobody will beat you if you use it in your code. It can save you few keystrokes. But it is less obvious and less general.
On Wed, Oct 23, 2019 at 4:33 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
23.10.19 14:00, Steven D'Aprano пише:
So please do educate me Serhiy, which one is the One Obvious Way that we should all agree is the right thing to do?
If you need a constant number, the most obvious way is to write it as a number literal, not int('123'). If you need a constant string, the most obvious way is to write it as a string literal, not bytes([65, 66]).decode(). If you need a list of constant strings, the most obvious way is to write it as a list display consisting of string literals. It works in all Python versions.
The second way works too in all actual Python versions (starting from 1.6), and nobody will beat you if you use it in your code. It can save you few keystrokes. But it is less obvious and less general.
Can we agree on the reply from Serhiy and close this discussion? The proposed change does not bring any advantage apart from few saved keystrokes and even that is questionable, because it makes the code more prone to misreading/misinterpretation. I can parse separately quoted string literals in the list (especially when they are highlighted by syntax coloring) much faster than read the one big string literal, while doing the mental split, keeping in mind which separators the author decided to use to make the split, and filtering some hardcoded chars which would otherwise get cut off. Richard
On Thu, Oct 24, 2019 at 12:34 AM Richard Musil <risa2000x@gmail.com> wrote:
Can we agree on the reply from Serhiy and close this discussion?
The proposed change does not bring any advantage apart from few saved keystrokes and even that is questionable, because it makes the code more prone to misreading/misinterpretation.
I can parse separately quoted string literals in the list (especially when they are highlighted by syntax coloring) much faster than read the one big string literal, while doing the mental split, keeping in mind which separators the author decided to use to make the split, and filtering some hardcoded chars which would otherwise get cut off.
Please, ignore the last paragraph of my reply, I guess I need to go to bed... Richard
On 23/10/2019 23:34:16, Richard Musil wrote:
Can we agree on the reply from Serhiy and close this discussion?
The proposed change does not bring any advantage apart from few saved keystrokes and even that is questionable, because it makes the code more prone to misreading/misinterpretation.
+1. Rob Cliffe
On Oct 23, 2019, at 03:08, Steven D'Aprano <steve@pearwood.info> wrote:
It could also be done by a source code preprocessor, or an AST transformation, without changing syntax.
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
The advantage of just optimizing split on a literal is that split becomes the One Obvious Way, and you know it will work and be correct in whatever version of implementation of Python you are using, back to 0.9; it’ll just be faster in CPython 3.9+. In fact, given that we already use split all over the place, and even offer shorthand for it in places like namedtuple, and people recommend it on python-list and StackOverflow without any pushback, I think it already is TOOWTDI for many cases. So why not optimize it? And your argument is really an argument against adding any optimizations to CPython. The fact that nested tuple literals are now as fast as constants means someone could be constructing one right in the middle of a bottleneck, making their code appear to work on all Python versions and pass benchmarks in current CPython but then be unacceptably slow when they deploy on CPython 3.4 or uPython or whatever. But would you say that optimization was a mistake, and we should have instead left nested tuple displays slow and invented a new syntax for nested tuple constants that would make it an obvious SyntaxError in 3.4 or uPython, just because it’s possible that one person might run into that unacceptably slow case one day, even though nobody has ever complained about it? And this is almost certainly the same thing. If someone has a case where they wrote out a long list of strings as a list literal with quotes instead of using split because benchmarking required it, where they would have been misled into using split if it were faster in 3.8 even though some of their deployment targets are 3.7, then we should listen. But I doubt anyone does. The optimization will just be a small QoI thing that adds to Python 3.9 being on average a bit faster than 3.8.
This talk about optimization is confusing me: These are literals -- they should only get processed once, generally on module import. If you are putting a long list of literal strings inside a tight loop, you are already not concerned with performance. Performance is absolutely the LAST reason to consider any proposal like this. I'm not saying that things like this shouldn't be optimized -- faster import is a good thing, but I am saying it's not a reason to add a language feature. -CHB On Wed, Oct 23, 2019 at 9:42 AM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
On Oct 23, 2019, at 03:08, Steven D'Aprano <steve@pearwood.info> wrote:
It could also be done by a source code preprocessor, or an AST transformation, without changing syntax.
But the advantage of changing the syntax is that it becomes the One Obvious Way, and you know it will be efficient whatever version or implementation of Python you are using.
The advantage of just optimizing split on a literal is that split becomes the One Obvious Way, and you know it will work and be correct in whatever version of implementation of Python you are using, back to 0.9; it’ll just be faster in CPython 3.9+.
In fact, given that we already use split all over the place, and even offer shorthand for it in places like namedtuple, and people recommend it on python-list and StackOverflow without any pushback, I think it already is TOOWTDI for many cases. So why not optimize it?
And your argument is really an argument against adding any optimizations to CPython. The fact that nested tuple literals are now as fast as constants means someone could be constructing one right in the middle of a bottleneck, making their code appear to work on all Python versions and pass benchmarks in current CPython but then be unacceptably slow when they deploy on CPython 3.4 or uPython or whatever. But would you say that optimization was a mistake, and we should have instead left nested tuple displays slow and invented a new syntax for nested tuple constants that would make it an obvious SyntaxError in 3.4 or uPython, just because it’s possible that one person might run into that unacceptably slow case one day, even though nobody has ever complained about it?
And this is almost certainly the same thing. If someone has a case where they wrote out a long list of strings as a list literal with quotes instead of using split because benchmarking required it, where they would have been misled into using split if it were faster in 3.8 even though some of their deployment targets are 3.7, then we should listen. But I doubt anyone does. The optimization will just be a small QoI thing that adds to Python 3.9 being on average a bit faster than 3.8.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7F3FC3... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Oct 23, 2019, at 10:04, Christopher Barker <pythonchb@gmail.com> wrote:
This talk about optimization is confusing me:
The main argument for why “a b c”.split() is not good enough, and therefore we need a new syntax, is that it’s “too slow”. Someone earlier in this thread said we could optimize calling split on a string literal, just as we can and do optimize iterating over a list literal in a for statement. The counter argument—which I thought you were adding onto—is that this would be bad because it would make people write bad code for older/alternative Pythons. The reason I thought you were adding onto that argument is that you said people should be able to write something and know it’ll be _efficient_ on every Python implementation. Why does efficient matter if this code will only show up in places where you are, as you say below, already not concerned with performance? That’s what I was responding to. If that wasn’t your point, I apologize for misreading it.
These are literals -- they should only get processed once, generally on module import.
If you are putting a long list of literal strings inside a tight loop, you are already not concerned with performance.
Performance is absolutely the LAST reason to consider any proposal like this.
I agree. That’s why I think “too slow” isn’t a good argument, and to the tiny extent that it is, “then let’s write an optimizer for the already-common idiom” is a good answer, not “let’s come up with a whole new syntax that does the same thing”.
On Thu, Oct 24, 2019 at 4:22 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Oct 23, 2019, at 10:04, Christopher Barker <pythonchb@gmail.com> wrote:
This talk about optimization is confusing me:
The main argument for why “a b c”.split() is not good enough, and therefore we need a new syntax, is that it’s “too slow”.
Someone earlier in this thread said we could optimize calling split on a string literal, just as we can and do optimize iterating over a list literal in a for statement.
I was the one to post it in this thread, but it wasn't my invention - talk of optimizing method calls on literals has been around before.
I agree. That’s why I think “too slow” isn’t a good argument, and to the tiny extent that it is, “then let’s write an optimizer for the already-common idiom” is a good answer, not “let’s come up with a whole new syntax that does the same thing”.
Agreed. The value of creating new syntax is (must be) that it better expresses programmer intent, not that it's easier to optimize. ChrisA
Andrew Barnert via Python-ideas wrote:
Someone earlier in this thread said we could optimize calling split on a string literal, just as we can and do optimize iterating over a list literal in a for statement.
The counter argument—which I thought you were adding onto—is that this would be bad because it would make people write bad code for older/alternative Pythons.
There's a precedent for this kind of thing -- there's an optimisation for repeatedly concatenating onto a string in some circumstances, even though building a list and joining it is recommended if you want guaranteed good performance. So the fact that it wouldn't apply to all versions and implementations of Python shouldn't really matter. I'm not sure how much it would really help, though. Lists being mutable, it would have to build a new list every time, unless it was also being used in a context where a tuple could be substituted, making it a doubly special case. I question whether there are many examples of such cases in the wild. -- Greg
On Oct 23, 2019, at 22:45, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Andrew Barnert via Python-ideas wrote:
Someone earlier in this thread said we could optimize calling split on a string literal, just as we can and do optimize iterating over a list literal in a for statement. The counter argument—which I thought you were adding onto—is that this would be bad because it would make people write bad code for older/alternative Pythons.
There's a precedent for this kind of thing -- there's an optimisation for repeatedly concatenating onto a string in some circumstances, even though building a list and joining it is recommended if you want guaranteed good performance. So the fact that it wouldn't apply to all versions and implementations of Python shouldn't really matter.
I'm not sure how much it would really help, though. Lists being mutable, it would have to build a new list every time,
Sure, but a small number of LOAD_CONSTs and a BUILD_LIST has to be faster than 1 LOAD_CONST and a call to the split method. From testing some different random examples, the split takes anywhere from 1.8x to 3.9x as long, and I assume with longer element strings it would be even more of a difference. I still doubt this ever occurs anywhere near a bottleneck in real-life code—but if it did, it seems like the optimization would be worth it. (Assuming a better micro-benchmark verifies my quick&dirty test.)
On Wed, Oct 23, 2019 at 10:04:36AM -0700, Christopher Barker wrote:
I'm not saying that things like this shouldn't be optimized -- faster import is a good thing, but I am saying it's not a reason to add a language feature.
I'm not sure why import is relevent here. We're not saying that performance is the only or even primary reason to add this feature. Shifting the processing from runtime to compile time is just icing on the cake. The proposal is for a less error-prone, easier to write and read way to express the programmer's intent to generate a list of single-word string literals. The current "obvious" solution is tedious, annoying, verbose (about a third longer than it need be) and error-prone.^1 It is so sub-optimal that some people resort to writing editor macros to write lists of words for them. Fine for those with the right skills and editor to do this, but not so great for the rest of us. The alternative solution using split() is less tedious, but it's contentious (some people think using split() is a hack) and shifts the work from compile-time to run-time, a pessimation.^2 ^1 I'm gratified to see that nobody yet has noticed the error in my earlier example involving the strings 'zero' through 'thirty', which supports my point that the status quo is sub-optimal. ^2 On my computer the split idiom is about forty percent slower: $ ./python -m timeit "['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']" 200000 loops, best of 5: 1.44 usec per loop $ ./python -m timeit "'a b c d e f g h'.split()" 100000 loops, best of 5: 2.05 usec per loop so it's not a trivial slowdown, even if this is unlikely to be a bottleneck in many programs. -- Steven
For a while I used to use a Perl-inspired `q("red green blue")` as a shortcut. That's one character more than the proposed syntax, I leave the one line implementation to readers.[*] Despite saving some characters, it wasn't important enough to bother keeping in a utility module, let alone song as single-purpose syntax. [*] I did this long enough ago that the implementation probably involved `import string` ... Something only us folks quite long-in-tooth remember.
On 10/24/2019 04:03 AM, Steven D'Aprano wrote:
The current "obvious" solution is tedious, annoying, verbose (about a third longer than it need be) and error-prone.^1
--> print("So should we have syntax for sentence literals" ... "so we don't forget spaces, too?")
^1 I'm gratified to see that nobody yet has noticed the error in my earlier example involving the strings 'zero' through 'thirty', which supports my point that the status quo is sub-optimal.
You said it was actual code, so I didn't pay close attention. Does this mean you have now fixed an error in your code? Your welcome. ;-)
^2 On my computer the split idiom is about forty percent slower:
$ ./python -m timeit "['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']" 200000 loops, best of 5: 1.44 usec per loop
$ ./python -m timeit "'a b c d e f g h'.split()" 100000 loops, best of 5: 2.05 usec per loop
so it's not a trivial slowdown, even if this is unlikely to be a bottleneck in many programs.
I hope you're not declaring that a 0.6 usec slowdown on a single import is worth optimizing. -- ~Ethan~
On Tue, Oct 22, 2019 at 7:57 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Oct 22, 2019 at 04:11:45PM -0400, Todd wrote:
On Tue, Oct 22, 2019 at 3:54 PM Steve Jorgensen <stevej@stevej.name> wrote:
See
https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio...
for what Ruby offers.
For me, the arrays are the most useful aspect.
%w{one two three} => ["one", "two", "three"]
I would expect %w{ ... } to return a set, not a list:
%w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple
This is growing into an entire new group of constructors for a very, very limited number of operations that have been privileged for some reason. Should %{a=b c=d} create dicts, too? Why not? Why should strings be privileged over, say, numbers? Why should %w[1 2 3] make ['1', '2', '3'] instead of [1, 2, 3]? And why whitespace instead of a comma? We have general ways to handle all of this stuff that doesn't lock us into a single special case.
and I would describe them as list/set/tuple "word literals". Unlike list etc displays [spam, eggs, cheese] these would actually be true literals that can be determined entirely at compile-time.
I don't know enough about the internals to say whether this would be possible or not.
I am not seeing the advantage of this. Can you provide some specific examples that you think would benefit from this syntax?
I would use this feature, or something like it, a lot, especially in doctests where there is a premium in being able to keep examples short and on one line.
Here is a small selection of examples from my code that would be improved by something like the suggested syntax. I have trimmed some of them for brevity, and to keep them on one line. (Anything with an ellipsis ... has been trimmed.) I have dozens more, but they'll all pretty similar and I don't want to bore you.
__slots__ = ('key', 'value', 'prev', 'next', 'count')
__all__ = ["Mode_Estimators", "Location", "mfv", ...]
The "string literal".split() idiom is especially common, especially for data tables of strings. Here are some examples:
NUMBERS = ('zero one two three ... twenty-eight twenty-nine').split()
_TOKENS = set("indent assign addassign subassign ...".split())
__all__ = 'loopup loopdown reduce whileloop recursive product'.split()
for i, colour in enumerate('Black Red Green Yellow Blue Magenta Cyan White'.split()):
for methodname in 'pow add sub mul truediv'.split():
attrs = "__doc__ __version__ __date__ __author__ __all__".split()
names = 'meta private dunder ignorecase invert'.split()
unsorted = "The quick brown Fox jumps over the lazy Dog".split()
blocks = chaff.pad('flee to south'.split(), key='george')
minmax('aa bbbb c ddd eeeee f ggggg'.split(), key=len)
My estimate is that I would use this "string literal".split() idiom:
- about 60-70% in doctests; - about 5-10% in other tests; - about 25% in non-test code.
Anyone who has had to write out a large, or even not-so-large, list of words could benefit from this. Why quote each word individually like a drudge, when the compiler could do it for you at compile-time?
Specifically as a convenience for this "list of words" use-case, namedtuple splits a single string into words, e.g.
namedtuple('Parameter', 'name alias default')
I do the same in some of my functions as well, to make it easier to pass lists of words.
Similarly, support for keyword arguments in the dict constructor was specifically added to ease the case where your keys were single words:
# {'spam': 1, 'eggs': 2} dict(spam=1, eggs=2)
Don't underestimate the annoyance factor of having to write out things by hand when the compiler could do it for you. Analogy: we have list displays to make it easy to construct a list:
mylist = [2, 7, -1]
but that's strictly unnecessary, since we could construct it like this:
mylist = list() mylist.append(2) mylist.append(7) mylist.append(-1)
If you think I'm being fascious about the list example, you've probably never used standard Pascal, which had arrays but no syntax to initialise them except via a sequence of assignments. That wasn't too bad if you could put the assignments in a loop, but was painful if the initial entries were strings or floats.
Yes, I understand that Python has syntactic sugar. But any new syntactic sugar necessarily has an uphill battle due people having to learn it, books and classes having to be updated, linters updated, new pep8 guidelines written, etc. We already have a way to split strings. So the question is why we need this in addition to what we already have, especially considering it is so radically different than anything else in Python. If the primary use-case is docstrings, then this is something everyone will have to learn very early on, it wouldn't be something people could just ignore if they didn't want to use it like, say, the @ matrix multiplication operator. So everyone would have to learn a completely new way of building lists, tuples, and sets that only applies to a particular combination of strings and whitespace.
For the example you gave, besides saving a few characters I don't see the advantage over the existing way we have to do that:
'one two three'.split()
One of the reasons why Python is "slow" is that lots of things that can be done at compile-time are deferred to run-time. I doubt that splitting short strings will often be a bottle-neck, but idioms like this cannot help to contribute (even if only a little bit) to the extra work the Python interpreter does at run-time:
load a pre-allocated string constant look up the "split" attribute in the instance (not found) look up the "split" attribute in the class call the descriptor protocol which returns a method call the method build and return a list garbage collect the string constant
versus:
build and return a list from pre-allocated strings
(Or something like this, I'm not really an expert on the Python internals, I just pretend to know what I'm talking about.)
Yes, but as far as I am aware Python doesn't typically add new syntax just to avoid a small performance penalty. The new syntax should have some real use-cases that current syntax can't solve. I am not seeing that here.
Python usually uses [ ] for list creation or indexing. Co-opting it for a substantially different purpose of string processing like this doesn't strike me as a good idea, especially since we have two string identifiers already, ' and ".
I'm not sure why you describe this as "string processing". The result you get is a list, not a string. This would be pure syntactic sugar for:
%w[words] # "words".split() %w{words} # set("words".split()) %w(words) # tuple("words".split())
except done by the compiler, at compile-time, not runtime.
The result is a list, but the input is a string. It is string processing the same way all the string methods are string processing.
On Tue, Oct 22, 2019 at 08:53:53PM -0400, Todd wrote: [I wrote this]
I would expect %w{ ... } to return a set, not a list:
%w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple
[Todd replied]
This is growing into an entire new group of constructors for a very, very limited number of operations that have been privileged for some reason.
Sure. That's what syntactic sugar is: privileging one particular thing over another. That's why, for example, we privilage the idiom: import spam eggs = spam.eggs by giving it special syntax, but not class Spam: ... spam = Spam(args) del Spam Some things are privileged. We privilage for-loops as comprehensions, but not while-loop; we privilage getting a bunch of indexes in a sequence as a slice ``sequence[start:end]`` but not getting a bunch of items from a dict. Not everything can be syntactic sugar; but that doesn't mean nothing should be syntactic sugar.
Should %{a=b c=d} create dicts, too? Why not?
Probably not, because we can already say ``dict(spam=x)`` to get the key "spam". That's specifically one of the motivating examples why dict takes keyword arguments. In the early days of Python, it didn't.
Why should strings be privileged over, say, numbers?
Because we don't write ints or floats or complex numbers with delimiters. We say 5, not "5".
Why should %w[1 2 3] make ['1', '2', '3'] instead of [1, 2, 3]?
Because the annoyance factor of having to quote each word is far greater than the annoyance factor of having to put commas between values.
And why whitespace instead of a comma?
Because seperating words with whitespace is convenient when you have a lot of data. The spacebar, Tab and Enter keys are nice, big targets which are easy to hit, the comma isn't. Splitting on whitespace means that spaces and newlines Just Work: data = %w[alpha beta gamma ... psi chi omega] # gives ['alpha', 'beta', 'gamma', ... 'psi', 'chi', 'omega'] whereas splitting on commas alone gives you a nasty surprise: data = %w[alpha, beta, gamma, ..., psi, chi, omega] # ['alpha', ' beta', ' gamma', ..., '\n psi', ' chi', ' omega'] To avoid that, you need to complicate the rule something like:to "commas or whitespace", or "commas optionally followed by whitespace", or something even more complicated. The more complicated the rule, the more surprising it will be when you get caught out by some odd corner case of the rule you weren't expecting. Splitting on whitespace is a nice, simple rule that cannot go wrong. Why make it more complicated than it needs to be?
We have general ways to handle all of this stuff that doesn't lock us into a single special case.
Who is talking about locking us into a special case? "string literal".split() will still work, so will ["string", "literal"].
and I would describe them as list/set/tuple "word literals". Unlike list etc displays [spam, eggs, cheese] these would actually be true literals that can be determined entirely at compile-time.
I don't know enough about the internals to say whether this would be possible or not.
It would be a pretty awful compiler that couldn't take a space-seperated sequence of characters and compile them as strings. I'm not wedded to the leading and trailing delimiters %w[ and ] if they turn out to be ambiguous with something else (the % operator?), but I don't think they will be. [...]
Yes, I understand that Python has syntactic sugar. But any new syntactic sugar necessarily has an uphill battle due people having to learn it, books and classes having to be updated, linters updated, new pep8 guidelines written, etc. We already have a way to split strings. So the question is why we need this in addition to what we already have,
Because it smooths out a minor annoyance and makes for a more pleasant programming experience for the coder, without having to worry (rightly or wrongly) about performance. The status quo is that every time I need to write a list or set of words, I have to stop and think: "Should I quote them all by hand, like a caveman, or get the interpreter to split it? If I get the interpreter to split it, will it hurt performance?" but with this proposed syntax, there is One Obvious Way to write a list of words. I won't have to think about it, or worry that I should be worrying about performance.
especially considering it is so radically different than anything else in Python.
Your idea of "radically different" is a lot less radical than mine. To me, radically different would mean something like Hypertalk syntax: put the value of the third line of text into word seven of result or Reverse Polish Notation syntax. Not adding a prefix to list delimiters. We already have string prefixes, we already have list delimiters, putting the two concepts together is not a huge conceptual leap. Certainly a lot smaller than adding async to the language.
So everyone would have to learn a completely new way of building lists, tuples, and sets that only applies to a particular combination of strings and whitespace.
Yes, everyone would have to learn this new feature. It would take most people approximately five seconds to get the basics and maybe a minute to explore the consequences in full. This isn't a complicated feature: its a list of whitespace delimited words. The most complicated feature I can think of is whether we should allow escaping spaces or not: names = %w[Aaron Susan Helen Fred Mary\ Beth] names = %w[Aaron Susan Helen Fred Mary%x20Beth] [...]
The new syntax should have some real use-cases that current syntax can't solve. I am not seeing that here.
True, this doesn't solve any problems that can't already be solved. It is pure syntactic sugar. It's nice when new syntax lets us do things that we couldn't do before, but that's not a requirement, and Python has lots of syntactic sugar simply because it makes the programming experience nicer: - decorator syntax versus explicit function calls - keyword arguments versus positional arguments - ``from ... import`` versus plain old ``import`` - slicing could easily be a method call - f-strings versus string formatting - triple-quoted strings, string escapes, and raw strings. I doubt that this proposal will change the language ecosystem as decorators did, but I think it would be a small improvement that removes a minor pain point. [...]
The result is a list, but the input is a string. It is string processing the same way all the string methods are string processing.
All source code is nothing but strings. We don't normally include parsing or lexing source code, or compile-time preprocessing, as "string processing". String processing normally refers to the actions your program takes to process strings, as opposed to compiling your program in the first case. When the compiler parses a list containing string literals into code: # input data = ['a', 'b', 'c'] # output 1 0 LOAD_CONST 0 ('a') 3 LOAD_CONST 1 ('b') 6 LOAD_CONST 2 ('c') 9 BUILD_LIST 3 12 STORE_NAME 0 (data) 15 LOAD_CONST 3 (None) 18 RETURN_VALUE we don't call that "string processing" (unless you're a compiler writer, I guess), and we shouldn't call %w[a b c] that either. The output should be identical, and we could implement this right now with a source code pre-processor. -- Steven
On Wed, Oct 23, 2019 at 8:33 PM Steven D'Aprano <steve@pearwood.info> wrote:
The most complicated feature I can think of is whether we should allow escaping spaces or not:
names = %w[Aaron Susan Helen Fred Mary\ Beth] names = %w[Aaron Susan Helen Fred Mary%x20Beth]
The second one? No. If you want that, use a post-processor or something. Using a backslash to escape a space would be a decent option, but I'd also be fine with disallowing it, if it makes it easier to define the grammar. If this syntax is restricted to a blank-separated sequence of atoms ("NAME" in the grammar), it will still be of significant value, and there's always the option to make it more flexible in the future. But I'm not a fan of the %w syntax. If it comes to selection of colour for the bikeshed, I'd rather that the list be created using another variant of the same syntax we currently have for list creation: numbers = [1, 2, 3, 4, 5] from_loop = [x * 2 for x in numbers] names = [from Aaron Susan Helen Fred Mary] "Build a list from this set of words." Every list creation starts with an open bracket and ends with a close bracket. But that's just bikeshedding. ChrisA
On Wed, Oct 23, 2019 at 08:50:06PM +1100, Chris Angelico wrote:
On Wed, Oct 23, 2019 at 8:33 PM Steven D'Aprano <steve@pearwood.info> wrote:
The most complicated feature I can think of is whether we should allow escaping spaces or not:
names = %w[Aaron Susan Helen Fred Mary\ Beth] names = %w[Aaron Susan Helen Fred Mary%x20Beth]
The second one? No. If you want that, use a post-processor or something.
Ouch! Sorry, that was a brain-fart, I meant \x20 like in a string. We surely will want to support the standard range of string escapes, not just ASCII identifiers, so once you support string escapes, you get \x20 for free. The words should be arbitrary sequences of Unicode characters, not just limited to identifiers.
But I'm not a fan of the %w syntax.
I'm not wedded to it :-) -- Steven
On Wed, Oct 23, 2019 at 9:23 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Oct 23, 2019 at 08:50:06PM +1100, Chris Angelico wrote:
On Wed, Oct 23, 2019 at 8:33 PM Steven D'Aprano <steve@pearwood.info> wrote:
The most complicated feature I can think of is whether we should allow escaping spaces or not:
names = %w[Aaron Susan Helen Fred Mary\ Beth] names = %w[Aaron Susan Helen Fred Mary%x20Beth]
The second one? No. If you want that, use a post-processor or something.
Ouch! Sorry, that was a brain-fart, I meant \x20 like in a string.
Oh! Then I withdraw the objection, heh.
We surely will want to support the standard range of string escapes, not just ASCII identifiers, so once you support string escapes, you get \x20 for free. The words should be arbitrary sequences of Unicode characters, not just limited to identifiers.
If you have string escapes, is "\]" a literal close bracket? It isn't in a string literal, and yet people will expect to be able to escape the delimiter. I think the proposal would work fine with a restricted alphabet for the tokens, with room to potentially expand it in the future. ChrisA
On Wed, Oct 23, 2019 at 5:30 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Oct 22, 2019 at 08:53:53PM -0400, Todd wrote:
[I wrote this]
I would expect %w{ ... } to return a set, not a list:
%w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple
[Todd replied]
This is growing into an entire new group of constructors for a very, very limited number of operations that have been privileged for some reason.
Sure. That's what syntactic sugar is: privileging one particular thing over another. That's why, for example, we privilage the idiom:
import spam eggs = spam.eggs
by giving it special syntax, but not
class Spam: ... spam = Spam(args) del Spam
Some things are privileged. We privilage for-loops as comprehensions, but not while-loop; we privilage getting a bunch of indexes in a sequence as a slice ``sequence[start:end]`` but not getting a bunch of items from a dict. Not everything can be syntactic sugar; but that doesn't mean nothing should be syntactic sugar.
This is getting bogged down in details. Let me explain as simply as I can why I don't think this is a good idea. Everyone has a different set of things they want privileged with a new syntax. Everyone has different things they consider to be "annoyances" that they wish took less characters to do. And everyone who wants a new syntax thinks that new syntax should the "one way" of doing that operation. If we accepted every syntax everyone wants the language would be unusable. We have to draw the line somewhere. For any new syntax I can think of, it significantly simplified real use-cases, was more expressive in some way, or made things more consistent. This, on the other hand, does none of these. Getting a performance benefit doesn't require a new syntax. So the only benefit this has is saving a few characters once per operation, at the expense of being less flexible. And again, if we made a new syntax every time someone wanted to save a few characters the language would be unusable. So I just don't think this reaches what is my understanding of the bar new syntax has to reach.
I would expect %w{ ... } to return a set, not a list:
%w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple
and I would describe them as list/set/tuple "word literals". Unlike list etc displays [spam, eggs, cheese] these would actually be true literals that can be determined entirely at compile-time.
A more convenient way to populate lists/tuples/sets full of strings at compile time seems like a win. If I might be allowed to bikeshed: the w seems unnecessary. Why not drop it in favor of a single character like %, and use an optional r for raw strings? %[words] # "words".split() %{words} # set("words".split()) %(words) # tuple("words".split()) %r[wo\rds] # "wo\\rds".split() %r{wo\rds} # set("wo\\rds".split()) %r(wo\rds) # tuple("wo\\rds".split())
On 10/23/19 6:07 AM, Ricky Teachey wrote:
I would expect %w{ ... } to return a set, not a list:
%w[ ... ] # list %w{ ... ] # set %w( ... ) # tuple
and I would describe them as list/set/tuple "word literals". Unlike list etc displays [spam, eggs, cheese] these would actually be true literals that can be determined entirely at compile-time.
A more convenient way to populate lists/tuples/sets full of strings at compile time seems like a win.
If I might be allowed to bikeshed: the w seems unnecessary. Why not drop it in favor of a single character like %, and use an optional r for raw strings?
%[words] # "words".split() %{words} # set("words".split()) %(words) # tuple("words".split()) %r[wo\rds] # "wo\\rds".split() %r{wo\rds} # set("wo\\rds".split()) %r(wo\rds) # tuple("wo\\rds".split())
At that point, the "obvious" choice is an "s" (short for "split") string rather than a whole new construct: >>> s"one two three" ["one", "two", "three"] which could be combined with "r" like f and b strings.
On 22/10/2019 20:53, Steve Jorgensen wrote:
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers.
For me, the arrays are the most useful aspect.
%w{one two three} => ["one", "two", "three"]
This smells like Perl's quoting operators. I wasn't a big fan of them even when I was a Perlmonger. Given the choice of "glyph doing something" and "glyph doing something I understand", I'll take the latter every time. -- Rhodri James *-* Kynesim Ltd
I don't see what's wrong with `["one", "two", "three"]`. It's the most explicit and from the compiler perspective it's probably also as optimal as it can get. Also it doesn't hurt readability. Actually it helps. With syntax highlighting the word boundaries immediately become clear. If you're having long lists of string literals and you're annoyed by having to type `"` and `,` for every element, then it is the job of your IDE to properly support you while coding, not the job of the syntax (as long as it's clear and concise). For that reason all the advanced IDEs with all their features exists. Without code completion for example you could also ask for new syntax that helps you abbreviating long variable names, because it's too much to type. So instead of writing `this_is_a_very_long_but_expressive_name` you could do `this_is...` in case there's only one name that starts with "this_is" which can be resolved from your scope. That would even shorten the code. Nevertheless I think that code completion is a good idea and that we have to use the exact same name every time. The same applies to these "word literals". If you need a list of words, you can already create a list literal with the words inside. If that's too much typing, then you should ask your favorite IDE to implement corresponding refactoring assistance. I'm pretty sure the guys at PyCharm would consider adding something like this (e.g. if the caret is inside a string literal you can access the context menu via <alt>+<enter> and there could be something like "split words"). Steve Jorgensen wrote:
See https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_%_Notatio... for what Ruby offers. For me, the arrays are the most useful aspect. %w{one two three} => ["one", "two", "three"]
I did a search, and I don't see that this has been suggested before, but I might have missed something. I'm guessing I'm not the first person to ask whether this seems like a desirable feature to add to Python.
This is getting a little ridiculous. If you can get by with a single literal list of words, write it once. ...if it's long enough to be annoying or becomes a maintenance burden, use the `split()` idiom. ...if that's considered a "hack" or "bad form", then run it in the shell once and copy/paste the result. ...if it might get mutated in a loop, copy it (words[:]). You'd be constructing a new one anyways. ...if it's just too long to maintain in code, just load it from a txt file once at runtime. These are simple engineering solutions to simple engineering problems. As a bonus, they don't require implementation maintainers to redesign their tokenizers/parsers or build a brand-new preprocesser. Python-ideas truly is a unique, wonderful, exhausting place. Back to the "plus/pipe" thread (formerly known as the "PEP 584" thread)...
If you can get by with a single literal list of words, write it once. ...if it's long enough to be annoying or becomes a maintenance burden, use the `split()` idiom. ...if that's considered a "hack" or "bad form", then run it in the shell once and copy/paste the result. ...if it might get mutated in a loop, copy it (words[:]). You'd be constructing a new one anyways. ...if it's just too long to maintain in code, just load it from a txt file once at runtime.
Well said, Brandt. My personal preference is to run "...".split() in the shell and copy/paste the output, as it takes an incredibly minimal amount of time to start up the REPL for simple one liners. In my experience, folks often seem to forget that the REPL (or IDLE shell) exists outside of demo examples; it's highly useful for quick micro-scripts. I'm not 100% opposed to the proposed functionality, but I'm against the syntax and don't consider the lack of a shortcut to be particularly detrimental in this case. IMO, anything that falls under the category of being a syntactical shortcut should be highly readable and fairly obvious as to what it's doing at a first glance. Otherwise, it adds an unnecessary cost to the learning curve of Python (which can very rapidly accumulate if it's not kept in check). On Thu, Oct 24, 2019 at 12:39 PM Brandt Bucher <brandtbucher@gmail.com> wrote:
This is getting a little ridiculous.
If you can get by with a single literal list of words, write it once. ...if it's long enough to be annoying or becomes a maintenance burden, use the `split()` idiom. ...if that's considered a "hack" or "bad form", then run it in the shell once and copy/paste the result. ...if it might get mutated in a loop, copy it (words[:]). You'd be constructing a new one anyways. ...if it's just too long to maintain in code, just load it from a txt file once at runtime.
These are simple engineering solutions to simple engineering problems. As a bonus, they don't require implementation maintainers to redesign their tokenizers/parsers or build a brand-new preprocesser.
Python-ideas truly is a unique, wonderful, exhausting place. Back to the "plus/pipe" thread (formerly known as the "PEP 584" thread)... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/THBBTI... Code of Conduct: http://python.org/psf/codeofconduct/
participants (19)
-
Andrew Barnert
-
Brandt Bucher
-
Chris Angelico
-
Christopher Barker
-
Dan Sommers
-
David Mertz
-
Dominik Vilsmeier
-
Ethan Furman
-
Greg Ewing
-
Kyle Stanley
-
Rhodri James
-
Richard Damon
-
Richard Musil
-
Ricky Teachey
-
Rob Cliffe
-
Serhiy Storchaka
-
Steve Jorgensen
-
Steven D'Aprano
-
Todd