allow `lambda' to be spelled λ
Hi list, Here is my speculative language idea for Python: Allow the following alternative spelling of the keyword `lambda': λ (That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).") Background: I have been using the Vim "conceal" functionality with a rule which visually replaces lambda with λ when editing Python files. I find this a great improvement in readability since λ is visually less distracting while still quite distinctive. (The fact that λ is syntax-colored as a keyword also helps with this.) However, at the moment the nice syntax is lost when looking at the file through another editor or viewer. Therefore I would really like this to be an official part of the Python syntax. I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension. Example code: lst.sort(key=lambda x: x.lookup_first_name()) lst.sort(key=λ x: x.lookup_first_name()) # Church numerals zero = λ f: λ x: x one = λ f: λ x: f(x) two = λ f: λ x: f(f(x)) (Yes, Python is my favorite Scheme dialect. Why did you ask?) Note that a number of other languages already allow this. (Racket, Haskell). You can judge the aesthetics of this on your own code with the following sed command. sed 's/\<lambda\>/λ/g' Advantages: * The lambda keyword is quite long and distracts from the "meat" of the lambda expression. Replacing it by a single-character keyword improves readability. * The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation), so it brings Python closer to being "executable pseudo-code". * The alternative spelling λ/lambda is quite intuitive (at least to anybody who knows Greek letters.) Disadvantages: For your convenience already noticed here: * Introducing λ is introducing TIMTOWTDI. * Hard to type with certain editors. But note that the old syntax is still available. Easy to fix by upgrading to VIM ;-) * Will turn a pre-existing legal identifier λ into a keyword. So backward-incompatible. Needless to say, my personal opinion is that the advantages outweigh the disadvantages. ;-) Greetings, Stephan
On Tue, Jul 12, 2016, at 09:42, Random832 wrote:
On Tue, Jul 12, 2016, at 08:38, Stephan Houben wrote:
I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension.
How about a haskell-style backslash?
Nobody has any thoughts at all?
To me the backslash already has a fairly strong association with "the next character is a literal". Overloading it would feel very strange. S On 14/07/16 14:10, Random832 wrote:
On Tue, Jul 12, 2016, at 09:42, Random832 wrote:
On Tue, Jul 12, 2016, at 08:38, Stephan Houben wrote:
I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension. How about a haskell-style backslash? Nobody has any thoughts at all?
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jul 14, 2016 at 11:22 PM, SW <walker_s@hotmail.co.uk> wrote:
To me the backslash already has a fairly strong association with "the next character is a literal". Overloading it would feel very strange.
But it also has the meaning of "the next character is special", such as \n for newline or \uNNNN for a Unicode escape. However, I suspect there might be a parsing conflict: do_stuff(stuff_with_long_name, more_stuff, what_is_next_arg, \ At that point in the parsing, are you looking at a lambda function or a line continuation? Sure, style guides would decry this (put the backslash with its function, dummy!), but the parser can't depend on style guides being followed. -1 on using backslash for this. -0 on λ. ChrisA
Ah, yes, sorry- it certainly holds that meaning to me as well. I agree with your stated views on this (and ratings): -1 on using backslash for this. -0 on λ. Thanks, S On 14/07/16 14:27, Chris Angelico wrote:
To me the backslash already has a fairly strong association with "the next character is a literal". Overloading it would feel very strange. But it also has the meaning of "the next character is special", such as \n for newline or \uNNNN for a Unicode escape. However, I suspect
On Thu, Jul 14, 2016 at 11:22 PM, SW <walker_s@hotmail.co.uk> wrote: there might be a parsing conflict:
do_stuff(stuff_with_long_name, more_stuff, what_is_next_arg, \
At that point in the parsing, are you looking at a lambda function or a line continuation? Sure, style guides would decry this (put the backslash with its function, dummy!), but the parser can't depend on style guides being followed.
-1 on using backslash for this. -0 on λ.
ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
For better or worse, except for string literals which can be anything as long as you set a coding comment, python is pure ascii which simplifies everything. Lambda is not in the first 128 characters of Unicode, so it is highly unlikely to be accepted. * ‘Lambda’ is exactly as discouraging to type as it needs to be. A more likely to be accepted alternate keyword is ‘whyareyounotusingdef’ * Python doesn’t attempt to look like mathematical formula * ‘Lambda’ spelling is intuitive to most people who program * TIMTOWTDI isn’t a religious edict. Python is more pragmatic than that. * It’s hard to type in ALL editors unless your locale is set to (ancient?) Greek. * … What are you doing to have an identifier outside of ‘[A-Za-z_][A-Za-z0-9_]+’? From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of Stephan Houben Sent: Tuesday, July 12, 2016 8:38 AM To: python-ideas@python.org Subject: [Python-ideas] allow `lambda' to be spelled λ Hi list, Here is my speculative language idea for Python: Allow the following alternative spelling of the keyword `lambda': λ (That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).") Background: I have been using the Vim "conceal" functionality with a rule which visually replaces lambda with λ when editing Python files. I find this a great improvement in readability since λ is visually less distracting while still quite distinctive. (The fact that λ is syntax-colored as a keyword also helps with this.) However, at the moment the nice syntax is lost when looking at the file through another editor or viewer. Therefore I would really like this to be an official part of the Python syntax. I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension. Example code: lst.sort(key=lambda x: x.lookup_first_name()) lst.sort(key=λ x: x.lookup_first_name()) # Church numerals zero = λ f: λ x: x one = λ f: λ x: f(x) two = λ f: λ x: f(f(x)) (Yes, Python is my favorite Scheme dialect. Why did you ask?) Note that a number of other languages already allow this. (Racket, Haskell). You can judge the aesthetics of this on your own code with the following sed command. sed 's/\<lambda\>/λ/g' Advantages: * The lambda keyword is quite long and distracts from the "meat" of the lambda expression. Replacing it by a single-character keyword improves readability. * The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation), so it brings Python closer to being "executable pseudo-code". * The alternative spelling λ/lambda is quite intuitive (at least to anybody who knows Greek letters.) Disadvantages: For your convenience already noticed here: * Introducing λ is introducing TIMTOWTDI. * Hard to type with certain editors. But note that the old syntax is still available. Easy to fix by upgrading to VIM ;-) * Will turn a pre-existing legal identifier λ into a keyword. So backward-incompatible. Needless to say, my personal opinion is that the advantages outweigh the disadvantages. ;-) Greetings, Stephan
On Wed, Jul 13, 2016 at 4:36 AM, <tritium-list@sdamon.com> wrote:
For better or worse, except for string literals which can be anything as long as you set a coding comment, python is pure ascii which simplifies everything. Lambda is not in the first 128 characters of Unicode, so it is highly unlikely to be accepted.
Incorrect as of Python 3 - it's pure Unicode :) You can have identifiers that use non-ASCII characters. The core language in the default interpreter is all ASCII in order to make it easy for most people to type, but there are variant Pythons that translate the keywords into other languages (I believe there's a Chinese Python, and possibly Korean?), which are then free to use whatever character set they like. A variant Python would be welcome to translate all the operators and keywords into single-character tokens, using Unicode symbols for NOT EQUAL TO and so on - including using U+03BB in place of 'lambda'. ChrisA
Chris Angelico writes:
A variant Python would be welcome to translate all the operators and keywords into single-character tokens, using Unicode symbols for NOT EQUAL TO and so on - including using U+03BB in place of 'lambda'.
Probably it would not be "welcome", except in the usual sense that "Python is open source, you can do what you want". There was extensive discussion about the issues surrounding the natural languages used by programmers in source documentation (eg, identifier choice and comments) at the time of PEP 263. The mojibake (choice of charset) problem has largely improved since then, thanks to Unicode adoption, especially UTF-8. But the "Tower of Babel" issue has not. Fundamentally, it's like women's clothes (they wear them to impress, ie, communicate to, other women -- few men have the interest to understand what is impressive ;-): programming is about programmers communicating to other programmers. Maintaining the traditional spelling of keywords and operators is definitely useful for that purpose. This is not to say that individuals who want a personalized[1] language are wrong, just that it would have a net negative impact on communication in teams. BTW, Barry long advocated use of some variant syntaxes (the one I like to remember inaccurately is "><" instead of "!="), and in fact provided an easter egg import (barry_is_flufl or something like that) that changed the syntax to suit him. I believe that module is pure Python, so people who want to customize the lexical definition of Python at the language level can do so AFAIK. You could probably even spell it "import λ等" (to take a very ugly page from the Book of GNU, mixing scripts in a single word -- the Han character means "etc."). Footnotes: [1] I don't have a better word. I mean something like "seasoned to taste", almost "tasteful" but not quite.
On Wed, Jul 13, 2016 at 12:04 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Chris Angelico writes:
A variant Python would be welcome to translate all the operators and keywords into single-character tokens, using Unicode symbols for NOT EQUAL TO and so on - including using U+03BB in place of 'lambda'.
Probably it would not be "welcome", except in the usual sense that "Python is open source, you can do what you want".
... programmers communicating to other programmers. ...
This is not to say that individuals who want a personalized[1] language are wrong, just that it would have a net negative impact on communication in teams.
A fair point. But Python has a strong mathematical side (look how big the numpy/scipy/matplotlib communities are), and we've already seen how strongly they prefer "a @ b" to "a.matmul(b)". If there's support for a language variant that uses more and shorter symbols, that would be where I'd expect to find it. ChrisA not a mathematician, although I play one on the internet sometimes
I use the vim conceal plugin myself too. It's whimsical, but I like the appearance of it. So I get the sentiment of the original poster. But in my conceal configuration, I substitute a bunch of characters visually (if the attachment works, and screenshot example of some, but not all will be in this message). And honestly, having my text editor make the substitution is exactly what I want. If anyone really wanted, a very simple preprocessor (really just a few lines of sed, or a few str.replace() calls) could transform your *.py+ files into *.py by simple string substitution. There's no need to change the inherent syntax. This would not be nearly as big a change as a superset language like Coconut (which looks interesting), it's just 1-to-1 correspondence between strings. Moreover, even if special characters *were* part of Python, I'd probably want my text editor to provide me shortcuts or aliases. I have no idea how to enter the Unicode GREEK LUNATE EPSILON SYMBOL directly from my keyboard. It would be much more practical for me to type '\epsilon' (a lá LaTeX), or really just 'in', and let my editor alias/substitute that for me. Likewise, to get a GREEK SMALL LETTER LAMDA, a very nice editor shortcut would be 'lambda'. Same goes if someone make a py+-preprocessor that takes the various special symbols—I'd still want mnemonics that are more easily available on my keyboard. On Tue, Jul 12, 2016 at 8:25 PM, Chris Angelico <rosuav@gmail.com> wrote:
A fair point. But Python has a strong mathematical side (look how big the numpy/scipy/matplotlib communities are), and we've already seen how strongly they prefer "a @ b" to "a.matmul(b)". If there's support for a language variant that uses more and shorter symbols, that would be where I'd expect to find it.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Wed, Jul 13, 2016 at 11:04:19AM +0900, Stephen J. Turnbull wrote:
There was extensive discussion about the issues surrounding the natural languages used by programmers in source documentation (eg, identifier choice and comments) at the time of PEP 263. The mojibake (choice of charset) problem has largely improved since then, thanks to Unicode adoption, especially UTF-8. But the "Tower of Babel" issue has not. Fundamentally, it's like women's clothes (they wear them to impress, ie, communicate to, other women -- few men have the interest to understand what is impressive ;-): programming is about programmers communicating to other programmers.
With respect Stephen, that's codswallop :-) It might be true that the average bogan[1] bloke or socially awkward geek (including myself) might not care about impressive clothes, but many men do dress to compete. The difference is more socio-economic: typically women dress to compete across most s-e groups, while men mostly do so only in the upper-middle and upper classes. And in the upper classes, competition tends to be more understated and subtle ("good taste"), i.e. expensive Italian suits rather than hot pants. Historically, it is usually men who dress like peacocks to impress socially, while women are comparatively restrained. The drab business suit of Anglo-American influence is no more representative of male clothing through the ages as is the Communist Chinese "Mao suit". And as for programmers... the popularity of one-liners, the obfuscated C competition, code golf, "clever coding tricks" etc is rarely for the purposes of communication *about code*. Communication is taking place, but its about social status and cleverness. There's a very popular StackOverflow site dedicated to code golf, where you will see people have written their own custom languages specifically for writing terse code. Nobody expects these languages to be used by more than a handful of people. That's not their point.
Maintaining the traditional spelling of keywords and operators is definitely useful for that purpose.
Okay, let's put aside the social uses of code-golfing and equivalent, and focus on quote-unquote "real code", where programmers care more about getting the job done and keeping it maintainable rather than competing with other programmers for status, jobs, avoiding being the sacrifical goat in the next round of stack-ranked layoffs, etc. You're right of course that traditional spelling is useful, but perhaps not as much as you think. After all, one person's traditional spelling is another person's confusing notation and a third person's excessively verbose spelling. Not too many people like Cobol-like spelling: add 1 to the_number over "n += 1". So I think that arguments for keeping "traditional spelling" are mostly about familiarity. If we learned lambda calculus in high school, perhaps λ would be less exotic. I think that there is a good argument to be made in favour of increasing the amount of mathematical notation used in code, but I would think that since a lot of my code is mathematical in nature. I can see that makes my code atypical. Coming back to the specific change suggested here, λ as an alternative keyword for lambda, I have a minor and major objection: The minor objection is that I think that λ is too useful a one-letter symbol to waste on a comparatively rare usage, anonymous functions. In mathematical code, I would prefer to keep λ for wavelength, or for the radioactive decay constant, rather than for anonymous functions. The major objection is that I think its still too hard to expect the average programmer to be able to produce the λ symbol on demand. We don't all have a Greek keyboard :-) I *don't* think that expecting programmers to learn λ is too difficult. It's no more difficult than the word "lambda", or that | means bitwise OR. Or for that matter, that * means multiplication. Yes, I've seen beginners stumped by that. (Sometimes we forget that * is not something you learn in maths class.) So overall, I'm a -1 on this specific proposal. [1] Non-Australians will probably recognise similar terms hoser, redneck, chav, gopnik, etc. -- Steve
On 13 July 2016 at 05:00, Steven D'Aprano <steve@pearwood.info> wrote:
Not too many people like Cobol-like spelling:
add 1 to the_number
over "n += 1". So I think that arguments for keeping "traditional spelling" are mostly about familiarity. If we learned lambda calculus in high school, perhaps λ would be less exotic.
It's probably also relevant in this context that more "modern" languages tend to avoid the term lambda but embrace "anonymous functions" with syntax such as (x, y) -> x+y or whatever. So while "better syntax for lambda expressions" is potentially a reasonable goal, I don't think that perpetuating the concept/name "lambda" is necessary or valuable. Paul
On 13.07.2016 10:51, Paul Moore wrote:
On 13 July 2016 at 05:00, Steven D'Aprano <steve@pearwood.info> wrote:
Not too many people like Cobol-like spelling:
add 1 to the_number
over "n += 1". So I think that arguments for keeping "traditional spelling" are mostly about familiarity. If we learned lambda calculus in high school, perhaps λ would be less exotic. It's probably also relevant in this context that more "modern" languages tend to avoid the term lambda but embrace "anonymous functions" with syntax such as
(x, y) -> x+y
or whatever.
So while "better syntax for lambda expressions" is potentially a reasonable goal, I don't think that perpetuating the concept/name "lambda" is necessary or valuable.
Exactly, there's not much value in having yet another way of writing 'lambda:'. Keeping other languages in mind and the conservative stance Python usually takes, the arrow ('=>') would be the only valid alternative for a "better syntax for lambda expressions". However, IIRC, this has been debated and won't happen. Personally, I have other associations with λ. Thus, I would rather see it as a variable name in such contexts.
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Steven D'Aprano writes:
And as for programmers... the popularity of one-liners, the obfuscated C competition, code golf, "clever coding tricks" etc is rarely for the purposes of communication *about code*.
Sure, but *Python* is popular because it's easy to communicate *with* and (usually) *about* Python code, and it does pretty well on "terse" for many algorithmic idioms. (Yes, there are other reasons -- reasonable performance, batteries included, etc. That doesn't make the design of the language *not* a reason for its popularity.) You seem to be understanding my statements to be much more general than they are. I'm only suggesting that this applies to Python as we know and love it, and to Pythonic tradition.
The major objection is that I think its still too hard to expect the average programmer to be able to produce the λ symbol on demand. We don't all have a Greek keyboard :-)
So what? If you run Mac OS X, Windows, or X11, you do have a keyboard capable of producing Greek. And the same chords work in any Unicode- capable editor, it's just that the Greek letters aren't printed on the keycaps. Neither are emoticons, nor the CUA gestures (bucky-X[1], bucky-C, bucky-V, and the oh-so-useful bucky-Z) but those are everywhere. Any 10-year-old can find them somehow! To the extent that Python would consider such changes (ie, a half-dozen or so one-character replacements for multicharacter operators or keywords), it would be very nearly as learnable to type them as to read them. The problem (if it exists, of course -- obviously, I believe it does but YMMV) is all about overloading people's ability to perceive the meaning of code without reading it token by token. Footnotes: [1] Bucky = Control, Alt, Meta, Command, Option, Windows, etc. keys.
On 13 July 2016 at 19:15, Stephen J. Turnbull <stephen@xemacs.org> wrote:
So what? If you run Mac OS X, Windows, or X11, you do have a keyboard capable of producing Greek. And the same chords work in any Unicode- capable editor, it's just that the Greek letters aren't printed on the keycaps. Neither are emoticons, nor the CUA gestures (bucky-X[1], bucky-C, bucky-V, and the oh-so-useful bucky-Z) but those are everywhere. Any 10-year-old can find them somehow!
Um, as someone significantly older than 10 years old, I don't know how to type a lambda character on my Windows UK keyboard... Note that memorising the Unicode code point, and doing the weird numpad "enter a numeric character code" trick that I can never remember how to do, doesn't count as a realistic option... If it were that easy to type arbitrary characters, why would Vim have the "digraph" facility to insert Unicode characters using 2-character abbreviations? Paul
On Wed, Jul 13, 2016 at 2:20 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Um, as someone significantly older than 10 years old, I don't know how to type a lambda character on my Windows UK keyboard...
FWIW, in IPython/Jupyter notebooks one can type \lambda followed by a tab to get the λ character. The difficulty of typing is a red herring. Once it is a part of the language, every editor targeted at a Python programmer will provide the means to type λ with fewer than 6 keystrokes (6 is the number of keystrokes needed to type "lambda" without autocompletion.) The unfamiliarity is also not an issue. I am yet to meet a Python programer who knows what the keyword "lambda" is but does not know how the namesake Greek character looks. I am +0 on this feature. I would be +1 if λ was not already a valid identifier. This is one of those features that can easily be ignored by someone who does not need it, but can significantly improve the experience of those who do.
On Wed, Jul 13, 2016 at 4:42 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Wed, Jul 13, 2016 at 2:20 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Um, as someone significantly older than 10 years old, I don't know how to type a lambda character on my Windows UK keyboard...
FWIW, in IPython/Jupyter notebooks one can type \lambda followed by a tab to get the λ character. The difficulty of typing is a red herring. Once it is a part of the language, every editor targeted at a Python programmer will provide the means to type λ with fewer than 6 keystrokes (6 is the number of keystrokes needed to type "lambda" without autocompletion.) The unfamiliarity is also not an issue. I am yet to meet a Python programer who knows what the keyword "lambda" is but does not know how the namesake Greek character looks.
I am +0 on this feature. I would be +1 if λ was not already a valid identifier.
This is one of those features that can easily be ignored by someone who does not need it, but can significantly improve the experience of those who do.
I -1 on this feature. Sorry to be blunt. Are we going to add omega, delta, psilon and the entire Greek alphabet? There should be one and only one way to write code in Python as far as a valid identifier is concerned. Is there an existing exception? I am not saying the experiences of others do not matter, but we should take a step back and see does this actually make sense? Also, how do you exactly enter this character for someone who doesn't really enter unicode character except for the ASCII alphanumeric characters on a daily basis? How many users can we retain and even convert with this approach? Is it really worth the go? Thanks. John On Wed, Jul 13, 2016 at 4:42 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Wed, Jul 13, 2016 at 2:20 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Um, as someone significantly older than 10 years old, I don't know how to type a lambda character on my Windows UK keyboard...
FWIW, in IPython/Jupyter notebooks one can type \lambda followed by a tab to get the λ character. The difficulty of typing is a red herring. Once it is a part of the language, every editor targeted at a Python programmer will provide the means to type λ with fewer than 6 keystrokes (6 is the number of keystrokes needed to type "lambda" without autocompletion.) The unfamiliarity is also not an issue. I am yet to meet a Python programer who knows what the keyword "lambda" is but does not know how the namesake Greek character looks.
I am +0 on this feature. I would be +1 if λ was not already a valid identifier.
This is one of those features that can easily be ignored by someone who does not need it, but can significantly improve the experience of those who do.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Jul 13, 2016, at 4:12 PM, John Wong <gokoproject@gmail.com> wrote:
Sorry to be blunt. Are we going to add omega, delta, psilon and the entire Greek alphabet?
Breaking news: the entire Greek alphabet is already available for use in Python. If someone wants to write code that looks like a series of missing character boxes on your screen she already can. PS: There is no letter called "psilon" in the Greek alphabet or anywhere else in the Unicode.
On 07/13/2016 06:28 PM, Alexander Belopolsky wrote:
On Jul 13, 2016, at 4:12 PM, John Wong wrote:
Sorry to be blunt. Are we going to add omega, delta, psilon and the entire Greek alphabet?
Breaking news: the entire Greek alphabet is already available for use in Python.
Indeed. We can all use any letters in unicode for our identifiers. Does Python currently use any one-letter keywords? So why start now? -- ~Ethan~
On Jul 13, 2016, at 8:57 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Does Python currently use any one-letter keywords?
No.
So why start now?
There is no slippery slope or a hidden agenda in this proposal. The lambda keyword is unique in the Python language. It is the only keyword that does not have an obvious meaning as an English word or an abbreviation of such. There is no other keyword that literally is a name of a letter. (Luckily, iota and rho are spelled range and len in Python and those are not keywords anyways.) Arguably, it would be more Pythonic to spell an anonymous function creation keyword as "function" or "fun" ("func" does not smell right), but Python ended up with an English name for a Greek letter. That was before Unicode became ubiquitous. I don't see much harm in allowing a proper spelling for it now.
On Wed, Jul 13, 2016 at 03:42:01PM -0500, Alexander Belopolsky wrote:
On Wed, Jul 13, 2016 at 2:20 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Um, as someone significantly older than 10 years old, I don't know how to type a lambda character on my Windows UK keyboard...
FWIW, in IPython/Jupyter notebooks one can type \lambda followed by a tab to get the λ character. The difficulty of typing is a red herring. Once it is a part of the language, every editor targeted at a Python programmer will provide the means to type λ
Great for those using an editor targeted at Python programmers, but most editors are more general than that. Which means that programmers will find themselves split into two camps: those who can easily type λ, and those that cannot. In the 1980s and 90s, I was a Macintosh user, and one nice feature of the Macs at the time was the ease of typing non-ASCII characters. (Of course there were a lot fewer back then: MacRoman is an 8-bit extention to ASCII, compared to Unicode with its thousands of code points.) Consequently I've used an Apple-specific language that included operators like ≠ ≤ ≥ and it is *really nice*. But Apple has the advantage of controlling the entire platform and they could ensure that these characters could be input from any application on any machine using exactly the same key sequence. (By memory, it was option-= to get ≠.) We don't have that advantage, and frankly I think you are underestimating the *practical* difficulties for input. I recently discovered (by accident!) the Linux compose key. So now I know how to enter µ at the keyboard: COMPOSE mu does the job. So maybe COMPOSE lambda works? Nope. How about COMPOSE l or shift-l or ll or la or yy (its an upside down y, right, and COMPOSE ee gives ə)? No, none of these things work on my system. They may work on your system: since discovering COMPOSE, I keep coming across people who state "oh, its easy to type such-and-such a character, just type COMPOSE key-sequence, its standard and will work on EVERY LINUX SYSTEM EVERYWHERE". Not a chance. The key bindings for COMPOSE are anything but standard. And COMPOSE is *really* hard to use well: it gives no feedback if you make a mistake except to silently ignore your keypresses (or insert the wrong character). So invariably, every time I want to enter a non-ASCII character, it takes me out of "thinking about code" into "thinking about how to enter characters", sometimes for minutes at a time as I hunt for the character in "Character Map" or google for it on the Internet. It may be reasonable to argue that code is read more than it is written: - suppose that reading λ has a *tiny* benefit of 1% over "lambda" (for those who have learned what it means); - but typing it is (lets say) 50 times harder than typing "lambda"; - but we read code 50 times as often as we type it; - so the total benefit (50*1.01 - 50) is positive. Invent your own numbers, and you'll come up with your own results. I don't think there's any *objective* way to decide this question. And that's why I don't think that Python should take this step: let other languages experiment with non-ASCII keywords first, or let people experiment with translators that transform ≠ into != and λ into lambda. -- Steve
On Jul 13, 2016, at 9:44 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Which means that programmers will find themselves split into two camps: those who can easily type λ, and those that cannot.
We already have two camps: those who don't mind using "lambda" and those who would only use "def." I would expect that those who will benefit most are people who routinely write expressions that involve a lambda that returns a lambda that returns a lambda. There is a niche for such programming style and using λ instead of lambda will improve the readability of such programs for those who can understand them in the current form. For the "def" camp, the possibility of a non-ascii spelling will serve as yet another argument to avoid using anonymous functions.
Alexander Belopolsky <alexander.belopolsky@gmail.com> writes:
We already have two camps: those who don't mind using "lambda" and those who would only use "def."
I don't know anyone in the latter camp, do you? I am in the camp that loves ‘lambda’ for some narrowly-specified purposes *and* thinks ‘def’ is generally a better tool. -- \ “… correct code is great, code that crashes could use | `\ improvement, but incorrect code that doesn’t crash is a | _o__) horrible nightmare.” —Chris Smith, 2008-08-22 | Ben Finney
On Thursday, July 14, 2016 at 9:51:16 AM UTC+5:30, Ben Finney wrote:
Alexander Belopolsky <alexander....@gmail.com <javascript:>> writes:
We already have two camps: those who don't mind using "lambda" and those who would only use "def."
I don't know anyone in the latter camp, do you?
I am in the camp that loves ‘lambda’ for some narrowly-specified purposes *and* thinks ‘def’ is generally a better tool.
I suspect the two major camps are those who consider CS to be a branch of math and those who dont Those who dont typically have a strong resistance to the facts of history: http://blog.languager.org/2015/03/cs-history-0.html
On Jul 13, 2016, at 9:44 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I think you are underestimating the *practical* difficulties for input.
I appreciate those difficulties (I am typing this on an iPhone), but I think they are irrelevant. I can imagine 3 scenarios: 1. (The 99% case) You will never see λ in the code and never write it yourself. You can be happily unaware of this feature. 2. You see λ occasionally, but don't like it. You continue using spelled out "lambda" (or just use "def") in the code that you write. 3. You work on a project where local coding style mandates that lambda is spelled λ. In this case, there will be plenty of places in the code base to copy and paste λ from. (In the worst case you copy and paste it from the coding style manual.) More likely, however, the project that requires λ would have a precommit hook that translates lambda to λ in all new code and you can continue using the 6-character keyword in your input.
3. You work on a project where local coding style mandates that lambda is spelled λ. In this case, there will be plenty of places in the code base to* copy and paste λ from*. (In the worst case you copy and paste it from the coding style manual.) More likely, however, the project that requires λ would have a precommit hook that translates lambda to λ in all new code and you can continue using the 6-character keyword in your input.
That this your solution makes the proposoal a -100 and if you're going to implement a precommit hook to convert lambda to λ you might has well implement a custom coding that supports λ (like pyxl does for html\xml tags). Python isn't and shouldn't be APL.
On Wed, Jul 13, 2016 at 7:44 PM, Steven D'Aprano <steve@pearwood.info> wrote:
- suppose that reading λ has a *tiny* benefit of 1% over "lambda" (for those who have learned what it means); - but typing it is (lets say) 50 times harder than typing "lambda"; - but we read code 50 times as often as we type it; - so the total benefit (50*1.01 - 50) is positive.
I actually *do* think λ is a little bit more readable. And I have no idea how to type it directly on my El Capitan system with the ABC Extended keyboard. But I still get 100% of the benefit in readability simply by using vim's conceal feature. If I used a different editor I'd have to hope for a similar feature (or program it myself), but this is purely a display question. Similarly, I think syntax highlighting makes my code much more readable, but I don't want colors for keywords built into the language. That is, and should remain, a matter of tooling not core language (I don't want https://en.wikipedia.org/wiki/ColorForth for Python). FWIW, my conceal configuration is at link I give in a moment. I've customized a bunch of special stuff besides lambda, take it or leave it: http://gnosis.cx/bin/.vim/after/syntax/python.vim -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 14.07.2016 08:39, David Mertz wrote:
On Wed, Jul 13, 2016 at 7:44 PM, Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
- suppose that reading λ has a *tiny* benefit of 1% over "lambda" (for those who have learned what it means); - but typing it is (lets say) 50 times harder than typing "lambda"; - but we read code 50 times as often as we type it; - so the total benefit (50*1.01 - 50) is positive.
I actually *do* think λ is a little bit more readable. And I have no idea how to type it directly on my El Capitan system with the ABC Extended keyboard. But I still get 100% of the benefit in readability simply by using vim's conceal feature. If I used a different editor I'd have to hope for a similar feature (or program it myself), but this is purely a display question. Similarly, I think syntax highlighting makes my code much more readable, but I don't want colors for keywords built into the language. That is, and should remain, a matter of tooling not core language (I don't want https://en.wikipedia.org/wiki/ColorForth for Python).
Very good point. That now is basically the core argument against it at least for me. So, -100 on the proposal from me. :)
FWIW, my conceal configuration is at link I give in a moment. I've customized a bunch of special stuff besides lambda, take it or leave it:
Nice thing. This could help those using lambda a lot (whatever reason they might have to do so). I will redirect it to somebody relying on vim heavily for Python development. Thanks. :)
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jul 14, 2016 at 4:19 PM, Sven R. Kunze <srkunze@mail.de> wrote:
On 14.07.2016 08:39, David Mertz wrote:
.....
That is, and should remain, a matter of tooling not core language (I don't want <https://en.wikipedia.org/wiki/ColorForth> https://en.wikipedia.org/wiki/ColorForth for Python).
Very good point. That now is basically the core argument against it at least for me. So, -100 on the proposal from me. :)
+1. Editor's job, neither CPython's interpreter nor core grammar. On Wed, Jul 13, 2016 at 9:28 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Jul 13, 2016, at 4:12 PM, John Wong <gokoproject@gmail.com> wrote:
Sorry to be blunt. Are we going to add omega, delta, psilon and the entire Greek alphabet?
Breaking news: the entire Greek alphabet is already available for use in Python. If someone wants to write code that looks like a series of missing character boxes on your screen she already can.
You misunderstood my point. I was referring to identifier, what the proposal is asking. Of course unicode is availalble, people always argue about unicode, who wouldn't know Python doesn't support unicode. Why should I write pi in two English characters instead of typing π? Python is so popular among the science community, so shouldn't we add that as well? Excerpt from the question on http://programmers.stackexchange.com/questions/16010/is-it-bad-to-use-unicod... : t = (µw-µl)/c # those are used in e = ε/c # multiple places. σw_new = (σw**2 * (1 - (σw**2)/(c**2)*Wwin(t, e)) + γ**2)**.5 If we were to vote on popularity, we'd look at comparing the size of functional programmers vs scientists. Not saying functional programmers don't matter (already stressed this in my previous comment), but this is more like an editor plugin. I personally would love to see such in VIM. Thanks. John
On 14 July 2016 at 23:13, John Wong <gokoproject@gmail.com> wrote:
Why should I write pi in two English characters instead of typing π? Python is so popular among the science community, so shouldn't we add that as well? Excerpt from the question on http://programmers.stackexchange.com/questions/16010/is-it-bad-to-use-unicod...:
t = (µw-µl)/c # those are used in e = ε/c # multiple places. σw_new = (σw**2 * (1 - (σw**2)/(c**2)*Wwin(t, e)) + γ**2)**.5
I'm not sure what you're saying here. You do realise that the above is perfectly valid Python 3? The SO question you quote is referring to the fact that identifiers are restricted to (Unicode) *letters* and that symbol characters can't be used as variable names. All of which is tangential to the question here which is about using Unicode in a *keyword*. Paul
On Fri, Jul 15, 2016 at 12:27 AM, Paul Moore <p.f.moore@gmail.com> wrote:
On 14 July 2016 at 23:13, John Wong <gokoproject@gmail.com> wrote:
Why should I write pi in two English characters instead of typing π? Python is so popular among the science community, so shouldn't we add that as well? Excerpt from the question on
http://programmers.stackexchange.com/questions/16010/is-it-bad-to-use-unicod... :
t = (µw-µl)/c # those are used in e = ε/c # multiple places. σw_new = (σw**2 * (1 - (σw**2)/(c**2)*Wwin(t, e)) + γ**2)**.5
I'm not sure what you're saying here. You do realise that the above is perfectly valid Python 3? The SO question you quote is referring to the fact that identifiers are restricted to (Unicode) *letters* and that symbol characters can't be used as variable names.
All of which is tangential to the question here which is about using Unicode in a *keyword*.
I would personally feel bad about using non-ASCII or even non-english variable names in code. Heck, I feel so bad about non-ASCII in code that I even mispell the à in my last name (Rodolà) and type a' instead. Extending that to a keyword sounds even worse. When Python 3 was cooking I remember there were debates on whether removing "lambda". It stayed, and I'm glad it did, but IMO that should tell it's not important enough to deserve the breakage of a rule which has never been broken (non-ASCII for a keyword). -- Giampaolo - http://grodola.blogspot.com
On 15 July 2016 at 09:18, Giampaolo Rodola' <g.rodola@gmail.com> wrote:
On Fri, Jul 15, 2016 at 12:27 AM, Paul Moore <p.f.moore@gmail.com> wrote:
All of which is tangential to the question here which is about using Unicode in a *keyword*.
I would personally feel bad about using non-ASCII or even non-english variable names in code. Heck, I feel so bad about non-ASCII in code that I even mispell the à in my last name (Rodolà) and type a' instead. Extending that to a keyword sounds even worse.
Unicode-as-identifier makes a lot of sense in situations where you have a data-driven API (like a pandas dataframe or collections.namedtuple) and the data you're working with contains Unicode characters. Hence my choice of example in http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-progra... - it's easy to imagine cases where the named tuple attributes are coming from a data source like headers in a CSV file, and in situations like that, folks shouldn't be forced into awkward workarounds just because their data contains non-ASCII characters.
When Python 3 was cooking I remember there were debates on whether removing "lambda". It stayed, and I'm glad it did, but IMO that should tell it's not important enough to deserve the breakage of a rule which has never been broken (non-ASCII for a keyword).
This I largely agree with, though. The *one* argument for improvement I see potentially working is the one I advanced back in March when I suggested that adding support for Java's lambda syntax might be worth doing: https://mail.python.org/pipermail/python-ideas/2016-March/038649.html However, any proposals along those lines need to be couched in terms of how they will advance the Python ecosystem as a whole, rather than "I like using lambda expressions in my code, but I don't like the 'lambda' keyword", as we have a couple of decades worth of evidence informing us that the latter isn't sufficient justification for change. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Friday, July 15, 2016 at 2:31:54 PM UTC+5:30, Nick Coghlan wrote:
However, any proposals along those lines need to be couched in terms
of how they will advance the Python ecosystem as a whole, rather than "I like using lambda expressions in my code, but I don't like the 'lambda' keyword", as we have a couple of decades worth of evidence informing us that the latter isn't sufficient justification for change.
As to the importance of lambdas, on a more conceptual level, most people understand that λ-calculus is theoretically important. A currently running discussion that may indicate that this is true and pragmatic/software engineering levels also http://degoes.net/articles/destroy-all-ifs At the other end of the spectrum on notational/lexical question… On Wednesday, July 13, 2016 at 10:13:39 PM UTC+5:30, David Mertz wrote: I use the vim conceal plugin myself too. It's whimsical, but I like
the appearance of it. So I get the sentiment of the original poster. But in my conceal configuration, I substitute a bunch of characters visually (if the attachment works, and screenshot example of some, but not all will be in this message). And honestly, having my text editor make the substitution is exactly what I want.
which I find very pretty! More in the same direction: http://blog.languager.org/2014/04/unicoded-python.html Not of course to be taken too literally but rather that the post-ASCII world is any-which-how going that direction As for Nick Coghlan wrote:
Unicode-as-identifier makes a lot of sense in situations
Do consider:
Α = 1 A = 2 Α + 1 == A True
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack Discussion on python list at https://mail.python.org/pipermail/python-list/2016-April/706544.html
On 18 July 2016 at 13:41, Rustom Mody <rustompmody@gmail.com> wrote:
Do consider:
Α = 1 A = 2 Α + 1 == A True
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack
Yes, we know - that dramatic increase in the attack surface is why PyPI is still ASCII only, even though full Unicode support is theoretically possible. It's not a major concern once an attacker already has you running arbitrary code on your system though, as the main problem there is that they're *running arbitrary code on your system*. , That means the usability gains easily outweigh the increased obfuscation potential, as worrying about confusable attacks at that point is like worrying about a dripping tap upstairs when the Brisbane River is already flowing through the ground floor of your house :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tuesday, July 19, 2016 at 10:20:29 AM UTC+5:30, Nick Coghlan wrote:
On 18 July 2016 at 13:41, Rustom Mody <rusto...@gmail.com <javascript:>> wrote:
Do consider:
Α = 1 A = 2 Α + 1 == A True
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack
Yes, we know - that dramatic increase in the attack surface is why PyPI is still ASCII only, even though full Unicode support is theoretically possible.
It's not a major concern once an attacker already has you running arbitrary code on your system though, as the main problem there is that they're *running arbitrary code on your system*. , That means the usability gains easily outweigh the increased obfuscation potential, as worrying about confusable attacks at that point is like worrying about a dripping tap upstairs when the Brisbane River is already flowing through the ground floor of your house :)
Cheers,
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character Chris Angelico pointed out the offending line: wf = wave.open(“test.wav”, “rb”) (should be wf = wave.open("test.wav", "rb") instead) Since he also said:
The solution may be as simple as running "python3 script.py" rather than "python script.py".
I pointed out that the python2 error was more helpful (to my eyes) than python3s Python3 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier Python2 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2 combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII makes it seem (to me) python's unicode support is somewhat wrongheaded.
One solution would be to restrict identifiers to only Unicode characters in appropriate classes. The open quotation mark is in the code class for punctuation, so it doesn't make sense to have it be part of an identifier. http://www.fileformat.info/info/unicode/category/index.htm On Tuesday, July 19, 2016 at 1:29:35 AM UTC-4, Rustom Mody wrote:
On Tuesday, July 19, 2016 at 10:20:29 AM UTC+5:30, Nick Coghlan wrote:
On 18 July 2016 at 13:41, Rustom Mody <rusto...@gmail.com> wrote:
Do consider:
Α = 1 A = 2 Α + 1 == A True
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack
Yes, we know - that dramatic increase in the attack surface is why PyPI is still ASCII only, even though full Unicode support is theoretically possible.
It's not a major concern once an attacker already has you running arbitrary code on your system though, as the main problem there is that they're *running arbitrary code on your system*. , That means the usability gains easily outweigh the increased obfuscation potential, as worrying about confusable attacks at that point is like worrying about a dripping tap upstairs when the Brisbane River is already flowing through the ground floor of your house :)
Cheers,
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character
Chris Angelico pointed out the offending line: wf = wave.open(“test.wav”, “rb”) (should be wf = wave.open("test.wav", "rb") instead)
Since he also said:
The solution may be as simple as running "python3 script.py" rather than "python script.py".
I pointed out that the python2 error was more helpful (to my eyes) than python3s
Python3
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier
Python2
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2 combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
makes it seem (to me) python's unicode support is somewhat wrongheaded.
On Tuesday, July 19, 2016 at 12:39:04 PM UTC+5:30, Neil Girdhar wrote:
One solution would be to restrict identifiers to only Unicode characters in appropriate classes. The open quotation mark is in the code class for punctuation, so it doesn't make sense to have it be part of an identifier.
Python (3) is doing that alright as far as I can see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers The point is that when it doesn’t fall in the classification(s) the error it raises suggests that the lexer is not really unicode-aware
On Tuesday, July 19, 2016 at 1:29:35 AM UTC-4, Rustom Mody wrote:
On Tuesday, July 19, 2016 at 10:20:29 AM UTC+5:30, Nick Coghlan wrote:
On 18 July 2016 at 13:41, Rustom Mody <rusto...@gmail.com> wrote:
Do consider:
> Α = 1 > A = 2 > Α + 1 == A True >
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack
Yes, we know - that dramatic increase in the attack surface is why PyPI is still ASCII only, even though full Unicode support is theoretically possible.
It's not a major concern once an attacker already has you running arbitrary code on your system though, as the main problem there is that they're *running arbitrary code on your system*. , That means the usability gains easily outweigh the increased obfuscation potential, as worrying about confusable attacks at that point is like worrying about a dripping tap upstairs when the Brisbane River is already flowing through the ground floor of your house :)
Cheers,
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character
Chris Angelico pointed out the offending line: wf = wave.open(“test.wav”, “rb”) (should be wf = wave.open("test.wav", "rb") instead)
The solution may be as simple as running "python3 script.py" rather
Since he also said: than "python script.py".
I pointed out that the python2 error was more helpful (to my eyes) than python3s
Python3
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier
Python2
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2 combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
makes it seem (to me) python's unicode support is somewhat wrongheaded.
Sounds like a bug in the lexer? Or maybe a feature request. On Tuesday, July 19, 2016 at 3:32:39 AM UTC-4, Rustom Mody wrote:
On Tuesday, July 19, 2016 at 12:39:04 PM UTC+5:30, Neil Girdhar wrote:
One solution would be to restrict identifiers to only Unicode characters in appropriate classes. The open quotation mark is in the code class for punctuation, so it doesn't make sense to have it be part of an identifier.
Python (3) is doing that alright as far as I can see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
The point is that when it doesn’t fall in the classification(s) the error it raises suggests that the lexer is not really unicode-aware
On Tuesday, July 19, 2016 at 1:29:35 AM UTC-4, Rustom Mody wrote:
On Tuesday, July 19, 2016 at 10:20:29 AM UTC+5:30, Nick Coghlan wrote:
On 18 July 2016 at 13:41, Rustom Mody <rusto...@gmail.com> wrote:
Do consider:
>> Α = 1 >> A = 2 >> Α + 1 == A True >>
Can (IMHO) go all the way to https://en.wikipedia.org/wiki/IDN_homograph_attack
Yes, we know - that dramatic increase in the attack surface is why PyPI is still ASCII only, even though full Unicode support is theoretically possible.
It's not a major concern once an attacker already has you running arbitrary code on your system though, as the main problem there is that they're *running arbitrary code on your system*. , That means the usability gains easily outweigh the increased obfuscation potential, as worrying about confusable attacks at that point is like worrying about a dripping tap upstairs when the Brisbane River is already flowing through the ground floor of your house :)
Cheers,
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character
Chris Angelico pointed out the offending line: wf = wave.open(“test.wav”, “rb”) (should be wf = wave.open("test.wav", "rb") instead)
The solution may be as simple as running "python3 script.py" rather
Since he also said: than "python script.py".
I pointed out that the python2 error was more helpful (to my eyes) than python3s
Python3
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier
Python2
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2 combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
makes it seem (to me) python's unicode support is somewhat wrongheaded.
On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character [...] I pointed out that the python2 error was more helpful (to my eyes) than python3s
And I pointed out how I thought the Python 3 error message could be improved, but the Python 2 error message was not very good.
Python3
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier
It would be much more helpful if the caret lined up with the offending character. Better still, if the offending character was actually stated: wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character '“' in identifier
Python2
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
As I pointed out earlier, this is less helpful. The line itself is not shown (although the line number is given), nor is the offending character. (Python 2 can't show the character because it doesn't know what it is -- it only knows the byte value, not the encoding.) But in the person's text editor, chances are they will see what looks to them like a perfectly reasonable character, and have no idea which is the byte \xe2.
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you are right that the lexer considers all otherwise-unexpected characters to be part of an identifier, why is that a problem? I agree that it is mildly misleading to say invalid character '“' in identifier when “ is not part of an identifier: py> '“test'.isidentifier() False but I don't think you can jump from that to your conclusion that Python's unicode support is somewhat "wrongheaded". Surely a much simpler, less inflammatory response would be to say that this one specific error message could be improved? But... is it REALLY so bad? What if we wrote it like this instead: py> result = my§function(arg) File "<stdin>", line 1 result = my§function(arg) ^ SyntaxError: invalid character in identifier Isn't it more reasonable to consider that "my§function" looks like it is intended as an identifier, but it happens to have an illegal character in it?
combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
How does @ emulate Perl? As for your second part, about not going outside of ASCII, yes, that is official policy for Python operators, keywords and builtins.
makes it seem (to me) python's unicode support is somewhat wrongheaded.
-- Steve
On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
There was this question on the python list a few days ago: Subject: SyntaxError: Non-ASCII character [...] I pointed out that the python2 error was more helpful (to my eyes) than python3s
And I pointed out how I thought the Python 3 error message could be improved, but the Python 2 error message was not very good.
Python3
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ariston/foo.py", line 31 wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character in identifier
It would be much more helpful if the caret lined up with the offending character. Better still, if the offending character was actually stated:
wf = wave.open(“test.wav”, “rb”) ^ SyntaxError: invalid character '“' in identifier
Python2
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "foo.py", line 31 SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
As I pointed out earlier, this is less helpful. The line itself is not shown (although the line number is given), nor is the offending character. (Python 2 can't show the character because it doesn't know what it is -- it only knows the byte value, not the encoding.) But in the person's text editor, chances are they will see what looks to them like a perfectly reasonable character, and have no idea which is the byte \xe2.
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you are right that the lexer considers all otherwise-unexpected characters to be part of an identifier, why is that a problem?
It's a problem because those characters could never be part of an identifier. So it seems like a bug.
I agree that it is mildly misleading to say
invalid character '“' in identifier
when “ is not part of an identifier:
py> '“test'.isidentifier() False
but I don't think you can jump from that to your conclusion that Python's unicode support is somewhat "wrongheaded". Surely a much simpler, less inflammatory response would be to say that this one specific error message could be improved?
But... is it REALLY so bad? What if we wrote it like this instead:
py> result = my§function(arg) File "<stdin>", line 1 result = my§function(arg) ^ SyntaxError: invalid character in identifier
Isn't it more reasonable to consider that "my§function" looks like it is intended as an identifier, but it happens to have an illegal character in it?
combine that with 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
How does @ emulate Perl?
As for your second part, about not going outside of ASCII, yes, that is official policy for Python operators, keywords and builtins.
makes it seem (to me) python's unicode support is somewhat wrongheaded.
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
--- You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/-gsjDSht8VU/unsubscribe. To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
On Tuesday, July 19, 2016 at 5:06:17 PM UTC+5:30, Neil Girdhar wrote:
On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:
On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you are right that the lexer considers all otherwise-unexpected characters to be part of an identifier, why is that a problem?
It's a problem because those characters could never be part of an identifier. So it seems like a bug.
An armchair-design solution would say: We should give the most appropriate answer for every possible unicode character category This would need to take all the Unicode character-categories and Python lexical-categories and 'cross-product' them — a humongous task to little advantage A more practical solution would be to take the best of the python2 and python3 current approaches: "Invalid character XX in line YY" and just reveal nothing about what lexical category — like identifier — python thinks the char is coming in. The XX is like python2 and the YY like python3 If it can do better than '\xe2' — ie a codepoint — that’s a bonus but not strictly necessary
On Tue, Jul 19, 2016 at 8:18 AM Rustom Mody <rustompmody@gmail.com> wrote:
On Tuesday, July 19, 2016 at 5:06:17 PM UTC+5:30, Neil Girdhar wrote:
On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:
On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you are right that the lexer considers all otherwise-unexpected characters to be part of an identifier, why is that a problem?
It's a problem because those characters could never be part of an identifier. So it seems like a bug.
An armchair-design solution would say: We should give the most appropriate answer for every possible unicode character category This would need to take all the Unicode character-categories and Python lexical-categories and 'cross-product' them — a humongous task to little advantage
I don't see why this is a "humongous task". Anyway, your solution boils down to the simplest fix in the lexer which is to block some characters from matching any category, does it not?
A more practical solution would be to take the best of the python2 and python3 current approaches: "Invalid character XX in line YY" and just reveal nothing about what lexical category — like identifier — python thinks the char is coming in.
The XX is like python2 and the YY like python3 If it can do better than '\xe2' — ie a codepoint — that’s a bonus but not strictly necessary
--
--- You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/-gsjDSht8VU/unsubscribe. To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
--- You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/-gsjDSht8VU/unsubscribe. To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
On Tuesday, July 19, 2016 at 7:41:38 PM UTC+5:30, Neil Girdhar wrote:
On Tue, Jul 19, 2016 at 8:18 AM Rustom Mody wrote:
On Tuesday, July 19, 2016 at 5:06:17 PM UTC+5:30, Neil Girdhar wrote:
On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:
On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
IOW 1. The lexer is internally (evidently from the error message) so ASCII-oriented that any “unicode-junk” just defaults out to identifiers (presumably comments are dealt with earlier) and then if that lexing action fails it mistakenly pinpoints a wrong *identifier* rather than just an impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you are right that the lexer considers all otherwise-unexpected characters to be part of an identifier, why is that a problem?
It's a problem because those characters could never be part of an identifier. So it seems like a bug.
An armchair-design solution would say: We should give the most appropriate answer for every possible unicode character category This would need to take all the Unicode character-categories and Python lexical-categories and 'cross-product' them — a humongous task to little advantage
I don't see why this is a "humongous task". Anyway, your solution boils down to the simplest fix in the lexer which is to block some characters from matching any category, does it not?
Block? Not sure what you mean… Nothing should change (in the simplest solution at least) apart from better error messages My suggested solution involved this: Currently the lexer — basically an automaton — reveals which state its in when it throws error involving "identifier" Suggested change: if in_ident_state: if current_char is allowable as ident_char: continue as before elif current_char is ASCII: Usual error else: throw error eliding the "in_ident state" else: as is... BTW after last post I tried some things and found other unsatisfactory (to me) behavior in this area; to wit:
x = 0o19 File "<stdin>", line 1 x = 0o19 ^ SyntaxError: invalid syntax
Of course the 9 cannot come in an octal constant but "Syntax Error"?? Seems a little over general My preferred fix: make a LexicalError sub exception to SyntaxError Rest should follow for both Disclaimer: I am a teacher and having a LexicalError category makes it nice to explain some core concepts However I understand there are obviously other more pressing priorities than to make python superlative as a CS-teaching language :-)
On Tue, Jul 19, 2016 at 07:40:42AM -0700, Rustom Mody wrote:
My suggested solution involved this: Currently the lexer — basically an automaton — reveals which state its in when it throws error involving "identifier" Suggested change:
if in_ident_state: if current_char is allowable as ident_char: continue as before elif current_char is ASCII: Usual error else: throw error eliding the "in_ident state" else: as is...
I'm sorry, you've lost me. Is this pseudo-code (1) of the current CPython lexer, (2) what you imagine the current CPython lexer does, or (3) what you think it should do? Because you call it a "change", but you're only showing one state, so it's not clear if its the beginning or ending state. Basically I guess what I'm saying is that if you are suggesting a concrete change to the lexer, you should be more precise about what needs to actually change.
BTW after last post I tried some things and found other unsatisfactory (to me) behavior in this area; to wit:
x = 0o19 File "<stdin>", line 1 x = 0o19 ^ SyntaxError: invalid syntax
Of course the 9 cannot come in an octal constant but "Syntax Error"?? Seems a little over general
My preferred fix: make a LexicalError sub exception to SyntaxError
What's the difference between a LexicalError and a SyntaxError? Under what circumstances is it important to distinguish between them? It would be nice to have a more descriptive error message, but why should I care whether the invalid syntax "0o19" is caught by a lexer or a parser or the byte-code generator or the peephole optimizer or something else? Really all I need to care about is: - it is invalid syntax; - why it is invalid syntax (9 is not a legal octal digit); - and preferably, that it is caught at compile-time rather than run-time. -- Steve
On 20 July 2016 at 00:40, Rustom Mody <rustompmody@gmail.com> wrote:
Disclaimer: I am a teacher and having a LexicalError category makes it nice to explain some core concepts However I understand there are obviously other more pressing priorities than to make python superlative as a CS-teaching language :-)
Given the motives of some of the volunteers in the community, "I am a teacher, the current error confuses my students, and I'm willing to discuss proposed alternative error messages with them to see if they're improvements" can actually be a good way to get people interested in helping out :) The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not (in some cases it's obvious, but in others it's hard to decide when you've reached a point of "good enough", so you throw up your hands and say "Eh, it's at least distinctive enough that people will be able to find it on Stack Overflow"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Jul 20, 2016 at 1:03 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not (in some cases it's obvious, but in others it's hard to decide when you've reached a point of "good enough", so you throw up your hands and say "Eh, it's at least distinctive enough that people will be able to find it on Stack Overflow").
Plus, there are all sorts of errors that look very different to humans, but identical to the parser. Steven showed us an example where an invalid character looked like it belonged in the identifier, yet to the parser it's just "this Unicode category isn't valid here", same as the bad-quote one. In a *very* few situations, a single error is common enough to be worth special-casing (eg print without parentheses, in recent Pythons), but otherwise, Stack Overflow or python-list will do a far better job of diagnosis than the lexer ever can. Ultimately, syntax errors represent an error in translation from a human's concept to the text that's given to the interpreter, and trying to reconstruct the concept from the errant text is better done by a human than a computer. Obviously it's great when the computer can read your mind, but it's never going to be perfect. ChrisA
On 2016-07-19 16:13, Chris Angelico wrote:
On Wed, Jul 20, 2016 at 1:03 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not (in some cases it's obvious, but in others it's hard to decide when you've reached a point of "good enough", so you throw up your hands and say "Eh, it's at least distinctive enough that people will be able to find it on Stack Overflow").
Plus, there are all sorts of errors that look very different to humans, but identical to the parser. Steven showed us an example where an invalid character looked like it belonged in the identifier, yet to the parser it's just "this Unicode category isn't valid here", same as the bad-quote one. In a *very* few situations, a single error is common enough to be worth special-casing (eg print without parentheses, in recent Pythons), but otherwise, Stack Overflow or python-list will do a far better job of diagnosis than the lexer ever can. Ultimately, syntax errors represent an error in translation from a human's concept to the text that's given to the interpreter, and trying to reconstruct the concept from the errant text is better done by a human than a computer. Obviously it's great when the computer can read your mind, but it's never going to be perfect.
Unicode has a couple of properties called "ID_Start" and "ID_Continue". The codepoint '“' doesn't match either of them, which is a good hint that Python shouldn't really be saying "invalid character in identifier" (it's the first character, but it can't be part of an identifier).
Nick Coghlan writes:
The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not
Personally, I think the real issue here is that the curly quote (and things like mathematical PRIME character) are easily confused with Python syntax and it all looks like grit on Tim's monitor. I tried substituting an emoticon and the DOUBLE INTEGRAL, and it was quite obvious what was wrong from the Python 3 error message.<wink/> However, in this case, as far as I can tell from the error messages induced by playing with ASCII, Python 3.5 thinks that all non- identifier ASCII characters are syntactic (so for example it says that with open($file.txt") as f: is "invalid syntax"). But for non-ASCII characters (I guess including the Latin 1 set?) they are either letters, numerals, or just plain not valid in a Python program AIUI (outside of strings and comments, of course). I would think the lexer could just treat each invalid character as an invalid_token, which is always invalid in Python syntax, and the error would be a SyntaxError with the message formatted something like "invalid character {} = U+{:04X}".format(ch, ord(ch)) This should avoid the strange placement of the position indicator, too. If someday we decide to use an non-ASCII character for a syntactic purpose, that's a big enough compatibility break in itself that changing the invalid character set (and thus the definition of invalid_token) is insignificant. I'm pretty sure this is what a couple of earlier posters have in mind, too.
1. Using SyntaxError for lexical errors sounds as strange as saying a misspell/typo is a syntax mistake in a natural language. A new "LexicalError" or "TokenizerError" for that makes sense. Perhaps both this new exception and SyntaxError should inherit from a new CompileError class. But the SyntaxError is already covering cases alike with the TabError (an IndentationError), which is a lexical analysis error, not a parser one [1]. To avoid such changes while keeping the name, at least the SyntaxError docstring should be "Compile-time error." instead of "Invalid Syntax.", and the documentation should be explicit that it isn't only about parsing/syntax/grammar but also about lexical analysis errors. 2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding. Including the row and column numbers in the message would be helpful. 3. There are people who like and use unicode chars in identifiers. Usually I don't like to translate comments/identifiers to another language, but I did so myself, using variable names with accents in Portuguese for a talk [2], mostly to give it a try. Surprisingly, few people noticed that until I said. The same can be said about Sympy scripts, where symbols like Greek letters would be meaningful (e.g. μ for the mean, σ for the standard deviation and Σ for the covariance matrix), so I'd argue it's quite natural. 4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol. 5. New keywords, no matter whether they fit into the 7-bit ASCII or requires Unicode, unavoidably breaks backwards compatibility at least to some degree. That happened with the "nonlocal" keyword in Python 3, for example. 6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having Unicode keywords is merely contingent on Python 2 behavior that emphasized ASCII-only code (besides comments and strings). 7. The discussion isn't about lambda or anti-lambda bias, it's about keyword naming and readability. Who gains/loses with that resource? It won't hurt those who never uses lambda and never uses Unicode identifiers. Perhaps Sympy users would feel harmed by that, as well as other scientific packages users, but looking for the "λ" char in GitHub I found no one using it alone within Python code. The online Python books written in Greek that I found were using only English identifiers. 8. I don't know if any consensus can emerge in this matter about lambdas, but there's another subject that can be discussed together: macros. What OP wants is exactly a "#define λ lambda", which would be only in the code that uses/needs such symbol with that meaning. A minimal lexical macro that just apply a single keyword token replacement by a identifier-like token is enough for him. I don't know a nice way to do that, something like "from __replace__ import lambda_to_λ" or even "def λ is lambda" would avoid new keywords, but I also don't know how desired this resource is (perhaps to translate the language keywords to another language?). 7. I really don't like the editor "magic", it would be better to create a packaging/setup.py translation script than that (something like 2to3). It's not about coloring/highlighting, nor about editors/IDEs features, it's about seeing the object/file itself, and colors never change that AFAIK. Also, most code I read isn't using my editor, sometimes it comes from cat/diff (terminal stdout output), vim/gedit/pluma (editor), GitHub/BitBucket (web), blogs/forums/e-mails, gitk, Spyder (IDE), etc.. That kind of "view" replacement would compromise some code alignment (e.g. multiline strings/comments) and line length, besides being a problem to look for code with tools like find + grep/sed/awk (which I use all the time). Still worse are the git hooks to perform the replacement before/after a commit: how should one test a code that uses that? It somehow feels out of control. [1] https://docs.python.org/3/reference/lexical_analysis.html [2] http://www.slideshare.net/djsbellini/20140416-garoa-hc-strategy 2016-07-20 13:44 GMT-03:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
Nick Coghlan writes:
The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not
Personally, I think the real issue here is that the curly quote (and things like mathematical PRIME character) are easily confused with Python syntax and it all looks like grit on Tim's monitor. I tried substituting an emoticon and the DOUBLE INTEGRAL, and it was quite obvious what was wrong from the Python 3 error message.<wink/>
However, in this case, as far as I can tell from the error messages induced by playing with ASCII, Python 3.5 thinks that all non- identifier ASCII characters are syntactic (so for example it says that
with open($file.txt") as f:
is "invalid syntax"). But for non-ASCII characters (I guess including the Latin 1 set?) they are either letters, numerals, or just plain not valid in a Python program AIUI (outside of strings and comments, of course).
I would think the lexer could just treat each invalid character as an invalid_token, which is always invalid in Python syntax, and the error would be a SyntaxError with the message formatted something like
"invalid character {} = U+{:04X}".format(ch, ord(ch))
This should avoid the strange placement of the position indicator, too.
If someday we decide to use an non-ASCII character for a syntactic purpose, that's a big enough compatibility break in itself that changing the invalid character set (and thus the definition of invalid_token) is insignificant.
I'm pretty sure this is what a couple of earlier posters have in mind, too.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Danilo J. S. Bellini --------------- "*It is not our business to set up prohibitions, but to arrive at conventions.*" (R. Carnap)
On 7/20/16, Danilo J. S. Bellini <danilo.bellini@gmail.com> wrote:
4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info: PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
2016-07-21 1:53 GMT-03:00 Pavol Lisy <pavol.lisy@gmail.com>:
On 7/20/16, Danilo J. S. Bellini <danilo.bellini@gmail.com> wrote:
4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
In this item I just said that most programmers would probably keep the same character in a source code file due to copying and pasting, and that even when it doesn't happen (the copy-and-paste action), visual differences like italic/bold/serif are enough for one to notice (when using another input method). At first, I was thinking on a code with one of those symbols as a variable name (any of them), but PEP3131 challenges that. Actually, any conversion to a normal form means that one should never use unicode identifiers outside the chosen normal form. It would be better to raise an error instead of converting. If there isn't any lint tool already complaining about that, I strongly believe that's something that should be done. When mixing strings and identifier names, that's not so predictable:
obj = type("SomeClass", (object,), {c: i for i, c in enumerate("Σ𝚺𝛴𝜮𝝨𝞢")})() obj.𝞢 == getattr(obj, "𝞢") False obj.Σ == getattr(obj, "Σ") True dir(obj) [..., 'Σ', '𝚺', '𝛴', '𝜮', '𝝨', '𝞢']
-- Danilo J. S. Bellini --------------- "*It is not our business to set up prohibitions, but to arrive at conventions.*" (R. Carnap)
On Thursday, July 21, 2016 at 11:45:11 AM UTC+5:30, Danilo J. S. Bellini wrote:
2016-07-21 1:53 GMT-03:00 Pavol Lisy <pavol...@gmail.com <javascript:>>:
On 7/20/16, Danilo J. S. Bellini <danilo....@gmail.com <javascript:>> wrote:
4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
In this item I just said that most programmers would probably keep the same character in a source code file due to copying and pasting, and that even when it doesn't happen (the copy-and-paste action), visual differences like italic/bold/serif are enough for one to notice (when using another input method).
At first, I was thinking on a code with one of those symbols as a variable name (any of them), but PEP3131 challenges that. Actually, any conversion to a normal form means that one should never use unicode identifiers outside the chosen normal form. It would be better to raise an error instead of converting.
Yes Agree I said “Nice!” for
Σ = 1 𝚺 = Σ + 1 𝛴 2
in comparison to:
А = 1 A = A + 1
because the A's look more indistinguishable than the sigmas and are internally more distinct If the choice is to simply disallow the confusables that’s probably the best choice IOW 1. Disallow co-existence of confusables (in identifiers) 2. Identify confusables to a normal form — like case-insensitive comparison and like NKFC 3. Leave the confusables to confuse My choice 1 better than 2 better than 3
On Thu, Jul 21, 2016 at 4:26 PM, Rustom Mody <rustompmody@gmail.com> wrote:
IOW 1. Disallow co-existence of confusables (in identifiers) 2. Identify confusables to a normal form — like case-insensitive comparison and like NKFC 3. Leave the confusables to confuse
My choice 1 better than 2 better than 3
So should we disable the lowercase 'l', the uppercase 'I', and the digit '1', because they can be confused? What about the confusability of "m" and "rn"? O and 0 are similar in some fonts. And case insensitivity brings its own problems - is "ss" equivalent to "ß", and is "ẞ" equivalent to either? Turkish distinguishes between "i", which upper-cases to "İ", and "ı", which upper-cases to "I". We already have interminable debates about letter similarities across scripts. I'm sure everyone agrees that Cyrillic "и" is not the same letter as Latin "i", but we have "AАΑ" in three different scripts. Should they be considered equivalent? I think not, because in any non-trivial context, you'll know whether the program's been written in Greek, a Slavic language, or something using the Latin script. But maybe you disagree. Okay; are "BВΒ" all to be considered equivalent too? What about "СC"? "XХΧᚷ"? They're visually similar, but they're not equivalent in any other way. And if you're going to say things should be considered equivalent solely on the basis of visuals, you get into a minefield - should U+200B ZERO WIDTH SPACE be completely ignored, allowing "AB" to be equivalent to "A\u200bB" as an identifier? This debate should probably continue on python-list (if anywhere). I doubt Python is going to change its normalization rules any time soon, and if it does, it'll need a very solid reason (and probably a PEP with all the pros and cons). ChrisA
On Thursday, July 21, 2016 at 12:51:27 PM UTC+5:30, Chris Angelico wrote:
On Thu, Jul 21, 2016 at 4:26 PM, Rustom Mody <rusto...@gmail.com <javascript:>> wrote:
IOW 1. Disallow co-existence of confusables (in identifiers) 2. Identify confusables to a normal form — like case-insensitive comparison and like NKFC 3. Leave the confusables to confuse
My choice 1 better than 2 better than 3
So should we disable the lowercase 'l', the uppercase 'I', and the digit '1', because they can be confused? What about the confusability of "m" and "rn"? O and 0 are similar in some fonts. And case insensitivity brings its own problems - is "ss" equivalent to "ß", and is "ẞ" equivalent to either? Turkish distinguishes between "i", which upper-cases to "İ", and "ı", which upper-cases to "I".
We already have interminable debates about letter similarities across scripts. I'm sure everyone agrees that Cyrillic "и" is not the same letter as Latin "i", but we have "AАΑ" in three different scripts. Should they be considered equivalent? I think not, because in any non-trivial context, you'll know whether the program's been written in Greek, a Slavic language, or something using the Latin script. But maybe you disagree. Okay; are "BВΒ" all to be considered equivalent too? What about "СC"? "XХΧᚷ"? They're visually similar, but they're not equivalent in any other way. And if you're going to say things should be considered equivalent solely on the basis of visuals, you get into a minefield - should U+200B ZERO WIDTH SPACE be completely ignored, allowing "AB" to be equivalent to "A\u200bB" as an identifier?
I said 1 better than 2 better than 3 Maybe you also want to add: Special cases aren't special enough to break the rules. Although practicality beats purity. followed by Errors should never pass silently. IOW setting out 1 better than 2 better than 3 does not necessarily imply its completely achievable
On Thu, Jul 21, 2016 at 5:47 PM, Rustom Mody <rustompmody@gmail.com> wrote:
On Thu, Jul 21, 2016 at 4:26 PM, Rustom Mody <rusto...@gmail.com> wrote:
IOW 1. Disallow co-existence of confusables (in identifiers) 2. Identify confusables to a normal form — like case-insensitive comparison and like NKFC 3. Leave the confusables to confuse
My choice 1 better than 2 better than 3
So should we disable the lowercase 'l', the uppercase 'I', and the digit '1', because they can be confused? What about the confusability of "m" and "rn"? O and 0 are similar in some fonts. And case insensitivity brings its own problems - is "ss" equivalent to "ß", and is "ẞ" equivalent to either? Turkish distinguishes between "i", which upper-cases to "İ", and "ı", which upper-cases to "I".
We already have interminable debates about letter similarities across scripts. I'm sure everyone agrees that Cyrillic "и" is not the same letter as Latin "i", but we have "AАΑ" in three different scripts. Should they be considered equivalent? I think not, because in any non-trivial context, you'll know whether the program's been written in Greek, a Slavic language, or something using the Latin script. But maybe you disagree. Okay; are "BВΒ" all to be considered equivalent too? What about "СC"? "XХΧᚷ"? They're visually similar, but they're not equivalent in any other way. And if you're going to say things should be considered equivalent solely on the basis of visuals, you get into a minefield - should U+200B ZERO WIDTH SPACE be completely ignored, allowing "AB" to be equivalent to "A\u200bB" as an identifier?
I said 1 better than 2 better than 3 Maybe you also want to add:
Special cases aren't special enough to break the rules. Although practicality beats purity.
followed by
Errors should never pass silently.
IOW setting out 1 better than 2 better than 3 does not necessarily imply its completely achievable
No; I'm not saying that. I'm completely disagreeing with #1's value. I don't think the language interpreter should concern itself with visually-confusing identifiers. Unicode normalization is about *equivalent characters*, not confusability, and I think that's as far as Python should go. ChrisA
On 21 July 2016 at 08:54, Chris Angelico <rosuav@gmail.com> wrote:
No; I'm not saying that. I'm completely disagreeing with #1's value. I don't think the language interpreter should concern itself with visually-confusing identifiers. Unicode normalization is about *equivalent characters*, not confusability, and I think that's as far as Python should go.
+1. There are plenty of ways of writing programs that don't do what the reader expects. (Non-malicious) people writing Python code shouldn't be using visually ambiguous identifiers. People running code they don't trust should check it (and yes, "are there any non-ASCII/confusable characters used in identifiers" is one check they should make, among many). Avoiding common mistakes is a good thing. But that's about as far as we should go. On that note, though, "smart quotes" do find their way into code, usually via cut and paste from documents in tools like MS Word that "helpfully" change straight quotes to smart ones. So it *may* be worth special casing a check for smart quotes in identifiers, and reporting something like "was this meant to be a string, but you accidentally used smart quotes"? On the other hand, people who routinely copy code samples from sources that mangle the quotes are probably used to errors of this nature, and know what went wrong even if the error is unclear. After all, I don't know *any* language that explicitly checks for this specific error. Personally, I don't think that the effort required is justified by the minimal benefit. But if someone were to be inclined to make that effort and produce a patch, an argument along the above lines (catching a common cut and paste error) might be sufficient to persuade someone to commit it. Paul
On Wed, Jul 20, 2016 at 11:26:58PM -0700, Rustom Mody wrote:
А = 1 A = A + 1
because the A's look more indistinguishable than the sigmas and are internally more distinct If the choice is to simply disallow the confusables that’s probably the best choice
IOW 1. Disallow co-existence of confusables (in identifiers)
That would require disallowing 1 l and I, as well as O and 0. Or are you, after telling us off for taking an ASCII-centric perspective, going to exempt ASCII confusables? In a dynamic language like Python, how do you prohibit these confusables? Every time Python does a name binding operation, is it supposed to search the entire namespace for potential confusables? That's going to be awful expensive. Confusables are a real problem in URLs, because they can be used for phishing attacks. While even the most tech-savvy user is vulnerable, it is especially the *least* savvy users who are at risk, which makes it all the more important to protect against confusables in URLs. But in programming code? Your demonstration with the Latin A and the Greek alpha Α or Cyrillic А is just a party trick. In a world where most developers do something like: pip install randompackage python -m randompackage without ever once looking at the source code, I think we have bigger problems. Or rather, even the bigger problems are not that big. If you're worried about confusables, there are alternatives other than banning them: your editor or linter might highlight them. Or rather than syntax highlighting, perhaps editors should use *semantic highlighting* and colour-code variables: https://medium.com/@evnbr/coding-in-color-3a6db2743a1e in which case your A and A will be highlighted in completely different colours, completely ruining the trick. (Aside: this may also help with the "oops I misspelled my variable and the compiler didn't complain" problem. If "self.dashes" is green and "self.dahses" is blue, you're more likely to notice the typo.) -- Steve
On 7/21/16, Danilo J. S. Bellini <danilo.bellini@gmail.com> wrote:
2016-07-21 1:53 GMT-03:00 Pavol Lisy <pavol.lisy@gmail.com>:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
In this item I just said that most programmers would probably keep the same character in a source code file due to copying and pasting, and that even when it doesn't happen (the copy-and-paste action), visual differences like italic/bold/serif are enough for one to notice (when using another input method).
At first, I was thinking on a code with one of those symbols as a variable name (any of them), but PEP3131 challenges that. Actually, any conversion to a normal form means that one should never use unicode identifiers outside the chosen normal form. It would be better to raise an error instead of converting. If there isn't any lint tool already complaining about that, I strongly believe that's something that should be done. When mixing strings and identifier names, that's not so predictable:
obj = type("SomeClass", (object,), {c: i for i, c in enumerate("Σ𝚺𝛴𝜮𝝨𝞢")})() obj.𝞢 == getattr(obj, "𝞢") False obj.Σ == getattr(obj, "Σ") True dir(obj) [..., 'Σ', '𝚺', '𝛴', '𝜮', '𝝨', '𝞢']
[getattr(obj, i) for i in dir(obj) if i in "Σ𝚺𝛴𝜮𝝨𝞢"] # [0, 1, 2, 3, 4, 5] but: [obj.Σ, obj.𝚺, obj.𝛴, obj.𝜮, obj.𝝨, obj.𝞢, ] # [0, 0, 0, 0, 0, 0] So you could mix any of them while editing identifiers. (but you could not mix them while writing parameters in getattr, setattr and type) But getattr, setattr and type are other beasts, because they can use "non identifiers", non letter characters too: setattr(obj,'+', 7) dir(obj) # ['+', ...] # but obj.+ is syntax error setattr(obj,u"\udcb4", 7) dir(obj) # [..., '\udcb4' ,...] obj = type("SomeClass", (object,), {c: i for i, c in enumerate("+-*/")})() Maybe there is still some Babel curse here and some sort of normalize_dir, normalize_getattr, normalize_setattr, normalize_type could help? I am not sure. They probably make things more complicated than simpler.
On Thursday, July 21, 2016 at 10:24:42 AM UTC+5:30, Pavol Lisy wrote:
On 7/20/16, Danilo J. S. Bellini <danilo....@gmail.com <javascript:>> wrote:
4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'} <http://python.org/psf/codeofconduct/>
Nice!
Σ = 1 𝚺 = Σ + 1 𝛴 2
But not enough
А = 1 A = A + 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'A' is not defined
Moral: The coarser the equivalence-relation the better (within reasonable limits!) NFKC-equality i coarser than literal-codepoint equality. ∴ Better But not coarse enough. After all identifiers are meant to identify!
Danilo J. S. Bellini writes:
1. Using SyntaxError for lexical errors sounds as strange as saying a misspell/typo is a syntax mistake in a natural language.
Well, I find that many typos are discovered even though they look like (and often enough are) real words, with unacceptable semantics (sometimes even the same part of speech). So I don't find that analogy at all compelling -- human recognition of typos is far more complex than computer recognition of parse errors. And the Python lexer is very simple, even among translators. It creates tokens for operators which are more or less self-delimiting, indentation, strings, and failing that sequences of characters delimited by spaces, newlines, and operators. Token recognition is now complete. For tokens of as-yet unknown type, it then checks whether the token is a keyword, if not, is it a number. If not, in a syntactically correct program, what's left is an identifier (and I suppose that's why this error message says "identifier", and why it points to the end of the token, not the "bad" character). It then checks the putative identifier and discovers that the token isn't well-formed as an identifier. I think it's a very good idea to keep this tokenization process simple. So in my proposal, it's intentionally not a lexical error, but rather a new kind of self-delimiting token (with no syntactic role in correct programs). A lexical error means that the translator failed to construct (including identifying the syntactic role) a token. That's very bad. Theoretically speaking, that means all bets are off, who knows what the rest of the program might mean? Pragmatically, you can use heuristics to generate error messages and reset the lexer to an "appropriate" state, but as Nick points out, those heuristics are unreliable and may do more harm than good, and it's not clear what the appropriate reset state is. Making an invalid_token (perhaps a better name for current purposes would be invalid_character_token) means that there are no lexical errors (except for UnicodeErrors, but they are "below" the level of the language definition). This is consistent with current Python practice for pure ASCII programs:
a$b File "<stdin>", line 1 a$b ^ SyntaxError: invalid syntax
Note that the caret is in the right place, so '$' is being treated as an operator. (The same happens with '?', the other non-identifier non-control ASCII character without specified semantics.) The advantage is that the tokenized program has much more structure, and much more restricted valid structure that it can match (correct positioning of the caret is an immediate benefit, see below), than an untokenized string (remember, it's already known to contain errors). Of course you could implicitly do the same thing at the lexical level, but "explicit is better than implicit". Since we're trying to reason about invalid programs, the motivation is heuristic either way, but an explicit definition of invalid_token means that the processing by the translator is easier to understand, and it would restrict the ways that handling of this error could change in the future. I consider that restriction to be a good thing in this context, YMMV.
2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding.
But it will print the character in the erroneous line and that character in the error message the same way, which should be enough (certainly will be enough in the "curly quotes" example). To identify the exact character that Python is concerned with (regardless of whether the glyphs in the error message are what the user sees in her editor) the Unicode scalar (or even the Unicode name, but that requires importing the Unicode character database which might be undesirable) is included.
Including the row and column numbers in the message would be helpful.
The line number is already there, the current tokenization process will set the column number to the place where the caret is. My proposal fixes this automatically without requiring Python to do more analysis than "end of token", which it already knows.
6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having Unicode keywords is merely contingent on Python 2 behavior that emphasized ASCII-only code (besides comments and strings).
It's more than that. For better or worse, English is the natural language source for Python keywords (even "elif" is a contraction, and feels natural to this native speaker), and I can think of no variant of English where (plausible) candidate keywords can't be spelled with ASCII. "lambda" itself is the only plausible exception as far as I know, and even there "lambda calculus" is perfectly good English now.
7. I really don't like the editor "magic", it would be better to create a packaging/setup.py translation script than that (something like 2to3).
2to3 can be used for this purpose, it's quite flexible about the rulesets that can be defined and specified. But note that that implies that adding this capability to the stdlib would fork the language within the CPython implemention, just as Python 3 is a fork from Python 2. That sounds like a bad idea to me -- some people have always complained that porting to Python 3 is almost like learning a new language, many people are already complaining that Python 3 is getting bigger than they like, and it would impose a burden on other implementations.
Still worse are the git hooks to perform the replacement before/after a commit: how should one test a code that uses that? It somehow feels out of control.
Exactly. All of this discussion about providing an alias for "lambda" seems out of control, and as a 20-year veteran of Emacs development (where there is no way to make a clean distinction between language and stdlib, apparently nobody has ever heard of TOOWTDI, and 3-line hacks are regularly committed to the core code), it gives me a terrifying feeling of deja vu. Improving the message for invalid identifiers of this particular kind, OTOH, is a straightforward extension of the existing mechanism.
On Wed, Jul 20, 2016 at 06:16:10PM -0300, Danilo J. S. Bellini wrote:
1. Using SyntaxError for lexical errors sounds as strange as saying a misspell/typo is a syntax mistake in a natural language.
Why? Regardless of whether the error is found by the tokeniser, the lexer, the parser, or something else, it is still a *syntax error*. Why would the programmer need to know, or care, what part of the compiler/interpreter detects the error? Also consider that not all Python interpreters will divide up the task of interpreting code exactly the same way. Tokenisers, lexers and parsers are very closely related and not necessarily distinct. Should the *exact same typo* generate TokenError in one Python, LexerError in another, and ParserError in a third? What is the advantage of that?
2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding. Including the row and column numbers in the message would be helpful.
It would be nice for the caret to point to the illegal character, but it's not *wrong* to point past it to the end of the token that contains the illegal character.
4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. Ther
Not really. Look at their names: GREEK CAPITAL LETTER SIGMA MATHEMATICAL BOLD CAPITAL SIGMA MATHEMATICAL ITALIC CAPITAL SIGMA MATHEMATICAL BOLD ITALIC CAPITAL SIGMA MATHEMATICAL SANS-SERIF BOLD CAPITAL SIGMA MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL SIGMA Personally, I don't understand why the Unicode Consortium has included all these variants. But whatever the reason, the names hint strongly that they have specialised purposes, and shouldn't be used when you want the letter Σ. But, if you do, Python will normalise them all to Σ, so there's no real harm done, except to the readability of your code. [...]
when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol.
You are assuming that the programmer's font includes glyphs for all of six of those code points. More likely, the programmer will see Σ for the first code point, and the other five will display as a pair of "missing glyph" boxes. (That's exactly what I see in my mail client, and in the Python interpreter.) Why a pair of boxes? Because they are code points in the Supplementary Multilingual Planes, and require *two* 16-bit code units in UTF-16. So naive Unicode software with poor support for the SMPs will display two boxes, one for each surrogate code point. Even if the code points display correctly, with distinct glyphs, your comment that most people will be forced to copy and paste the symbol is precisely why I am reluctant to see Python introduce non-ASCII keywords or operators. It's a pity, because I think that non-ASCII operators at least can make a much richer language (although I wouldn't want to see anything as extreme as APL). Perhaps I will change my mind in a few more years, as the popularity of emoji encourage more applications to have better support for non-ASCII and the SMPs. [...]
6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having Unicode keywords is merely contingent on Python 2 behavior that emphasized ASCII-only code (besides comments and strings).
No, it is a *policy decision*. It is not because Python 2 didn't support them. Python 2 didn't support non-ASCII identifiers either, but Python 3 intentionally broke with that.
7. The discussion isn't about lambda or anti-lambda bias, it's about keyword naming and readability. Who gains/loses with that resource? It won't hurt those who never uses lambda and never uses Unicode identifiers.
It will hurt those who have to read code with a mystery λ that they don't know what it means and they have no idea how to search for it. At least "python lambda" is easy to search for. It will hurt those who want to use λ as an identifier. I include myself in that category. I don't want λ to be reserved as a keyword. I look at it like this: use λ as a keyword makes as much sense as making f a keyword so that we can save a few characters by writing: f myfunction(arg, x, y): pass instead of def. I use f as an identifier in many places, e.g.: for f in list_of_functions: ... or in functional code: compose(f, g) Yes, I can *work around it* by naming things f_ instead of f, but that's ugly. Even though it saves a few keystrokes, I wouldn't want f to be reserved as a keyword, and the same goes for λ as lambda.
8. I don't know if any consensus can emerge in this matter about lambdas, but there's another subject that can be discussed together: macros.
I'm pretty sure that Guido has ruled "Over My Dead Body" to anything resembling macros in Python. However, we can experiment with adding keywords and macro-like facilities without Guido's permission. For example: http://www.staringispolite.com/likepython/ It's a joke, of course, but the technology is real. Imagine, if you will, that somebody you could declare a "dialect" at the start of Python modules, just after the optional language cookie: # -*- coding: utf-8 -*- # -*- dialect math -*- which would tell importlib to run the code through some sort of source/AST transformation before importing it. That will allow us to localise the keywords, introduce new operators, and all the other things Guido hates *wink* and still be able to treat the code as normal Python. A bad idea? Probably an awful one. But it's worth experimenting with it, It will be fun, and it *just might* turn out to be a good idea. For the record, in the 1980s and 1990s, Apple used a similar idea for two of their scripting languages, Hypertalk and Applescript, allowing users to localise keywords. Hypertalk is now defunct, and Applescript has dropped that feature, which suggests that it is a bad idea. Or maybe it was just ahead of its time. -- Steve
On Fri, Jul 22, 2016 at 12:25 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Jul 20, 2016 at 06:16:10PM -0300, Danilo J. S. Bellini wrote:
2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding. Including the row and column numbers in the message would be helpful.
It would be nice for the caret to point to the illegal character, but it's not *wrong* to point past it to the end of the token that contains the illegal character.
And it's currently being explored here: http://bugs.python.org/issue27582 If you like the idea of the caret pointing somewhere else, join the discussion. ChrisA
On Jul 21, 2016 7:26 AM, "Steven D'Aprano" <steve@pearwood.info> wrote:
You are assuming that the programmer's font includes glyphs for all of six of those code points. More likely, the programmer will see Σ for the first code point, and the other five will display as a pair of "missing glyph" boxes. (That's exactly what I see in my mail client, and in the Python interpreter.)
Fwiw, on my OSX laptop, with whatever particular fonts I have installed there, using a particular webmail service in the particular browser I use, I see all six glyphs. If I were to copy-paste into a text editor, all bets would be off, and depend on the editor and it its settings. Same for interactive shells run in particular terminal apps. Viewing right now, on my Android tablet and the Gmail app, I see a bunch of missing glyph markers. But quite likely I could install fonts or change settings on this device to render them.
On 7/21/2016 10:25 AM, Steven D'Aprano wrote:
Imagine, if you will, that somebody you could declare a "dialect" at the start of Python modules, just after the optional language cookie:
# -*- coding: utf-8 -*- # -*- dialect math -*-
which would tell importlib to run the code through some sort of source/AST transformation before importing it. That will allow us to localise the keywords, introduce new operators, and all the other things Guido hates *wink* and still be able to treat the code as normal Python.
Or one could write a 'unipy' extension to an IDE like IDLE that would translate an entire editor buffer either way. It would take less time than has been expended pushing for a change that will not happen in the near future.
A bad idea? Probably an awful one. But it's worth experimenting with it, It will be fun, and it *just might* turn out to be a good idea.
For the record, in the 1980s and 1990s, Apple used a similar idea for two of their scripting languages, Hypertalk and Applescript, allowing users to localise keywords. Hypertalk is now defunct, and Applescript has dropped that feature, which suggests that it is a bad idea. Or maybe it was just ahead of its time.
-- Terry Jan Reedy
On 19 July 2016 at 22:05, Rustom Mody <rustompmody@gmail.com> wrote:
A more practical solution would be to take the best of the python2 and python3 current approaches: "Invalid character XX in line YY" and just reveal nothing about what lexical category — like identifier — python thinks the char is coming in.
There's historically been relatively little work put into designing the error messages coming out of the lexer, so if it's a task you're interested in stepping up and taking on, you could probably find someone willing to review the patches. But if you perceive "Volunteers used their time as efficiently as possible whilst fully Unicode enabling the CPython compilation toolchain, since it was a dependency that needed to be addressed in order to permit other more interesting changes, rather than an inherently rewarding activity in its own right" as "wrongheaded", you may want to spend some time considering the differences between community-driven and customer-driven development. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tuesday, July 19, 2016 at 8:26:41 PM UTC+5:30, Nick Coghlan wrote:
But if you perceive "Volunteers used their time as efficiently as
possible whilst fully Unicode enabling the CPython compilation toolchain, since it was a dependency that needed to be addressed in order to permit other more interesting changes, rather than an inherently rewarding activity in its own right" as "wrongheaded", you may want to spend some time considering the differences between community-driven and customer-driven development.
Hi Nick Sorry if I caused offense. Ive been using Python since around 2001 and its been a strikingly pleasant relationship. There have been surprisingly few times when python let me down in a class (Only exception I remember in all these years: https://mail.python.org/pipermail/python-list/2011-July/609369.html ) Which is a generally better record than most other languages. So I remain grateful to Guido and the devs for this pleasing creation. My “wrongheaded” was (intended) quite narrow and technical: - The embargo on non-ASCII everywhere in the language except identifiers (strings and comments obviously dont count as “in” the language - The opening of identifiers to large swathes of Unicode widens as you say hugely the surface area of attack This was solely the contradiction I was pointing out.
Completely matches my opinion. Thanks. On 12.07.2016 20:36, tritium-list@sdamon.com wrote:
For better or worse, except for string literals which can be anything as long as you set a coding comment, python is pure ascii which simplifies everything. Lambda is not in the first 128 characters of Unicode, so it is highly unlikely to be accepted.
·‘Lambda’ is exactly as discouraging to type as it needs to be. A more likely to be accepted alternate keyword is ‘whyareyounotusingdef’
·Python doesn’t attempt to look like mathematical formula
·‘Lambda’ spelling is intuitive to most people who program
·TIMTOWTDI isn’t a religious edict. Python is more pragmatic than that.
·It’s hard to type in ALL editors unless your locale is set to (ancient?) Greek.
·… What are you doing to have an identifier outside of ‘[A-Za-z_][A-Za-z0-9_]+’?
*From:*Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] *On Behalf Of *Stephan Houben *Sent:* Tuesday, July 12, 2016 8:38 AM *To:* python-ideas@python.org *Subject:* [Python-ideas] allow `lambda' to be spelled λ
Hi list,
Here is my speculative language idea for Python:
Allow the following alternative spelling of the keyword `lambda':
λ
(That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).")
Background:
I have been using the Vim "conceal" functionality with a rule which visually
replaces lambda with λ when editing Python files. I find this a great improvement in
readability since λ is visually less distracting while still quite distinctive.
(The fact that λ is syntax-colored as a keyword also helps with this.)
However, at the moment the nice syntax is lost when looking at the file through another editor or viewer.
Therefore I would really like this to be an official part of the Python syntax.
I know people have been clamoring for shorter lambda-syntax in the past, I think this is
a nice minimal extension.
Example code:
lst.sort(key=lambda x: x.lookup_first_name())
lst.sort(key=λ x: x.lookup_first_name())
# Church numerals
zero = λ f: λ x: x
one = λ f: λ x: f(x)
two = λ f: λ x: f(f(x))
(Yes, Python is my favorite Scheme dialect. Why did you ask?)
Note that a number of other languages already allow this. (Racket, Haskell).
You can judge the aesthetics of this on your own code with the following sed command.
sed 's/\<lambda\>/λ/g'
Advantages:
* The lambda keyword is quite long and distracts from the "meat" of the lambda expression.
Replacing it by a single-character keyword improves readability.
* The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation),
so it brings Python closer to being "executable pseudo-code".
* The alternative spelling λ/lambda is quite intuitive (at least to anybody who knows Greek letters.)
Disadvantages:
For your convenience already noticed here:
* Introducing λ is introducing TIMTOWTDI.
* Hard to type with certain editors.
But note that the old syntax is still available.
Easy to fix by upgrading to VIM ;-)
* Will turn a pre-existing legal identifier λ into a keyword.
So backward-incompatible.
Needless to say, my personal opinion is that the advantages outweigh the disadvantages. ;-)
Greetings,
Stephan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Racket has this. I write lambdas there and I don't use (or know people that use) the symbol anyway. It really does not save a lot. Python is more of a hammer than a beautiful mathematical tool, so the examples don't seem too convincing. Plus, 80 columns and shortness does not make so much sense nowadays.
Stephan Houben <stephanh42@gmail.com> writes:
Here is my speculative language idea for Python:
Thank you for raising it.
Allow the following alternative spelling of the keyword `lambda': λ […]
Therefore I would really like this to be an official part of the Python syntax.
I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension.
The question is not whether it would be nice, nor how many people have clamoured for it. That you feel this supports the proposal isn't a good sign :-) The question is: What significant improvements to the language are made by this proposal, to counteract the significant cost of *any* such change to the language syntax?
Advantages:
* The lambda keyword is quite long and distracts from the "meat" of the lambda expression. Replacing it by a single-character keyword improves readability.
I disagree on this point. Making lambda easier is an attractive nuisance; the ‘def’ statement is superior (for Python) in most situations, and so lamda expressions should not be easier than that.
* The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation), so it brings Python closer to being "executable pseudo-code".
How is that an advantage?
* The alternative spelling λ/lambda is quite intuitive (at least to anybody who knows Greek letters.)
I reject this use of “intuitive; no one knows intuitively what lambda is, what λ is, what the correspondence between them is, and what they mean in various contexts. All of that needs to be learned, specifically. so if this is an advantage, it needs to be expressed somehow other than “intuitive”. Maybe you mean “familiar”, and avoid that term because it makes for a weaker argument?
Disadvantages:
I agree with your assessment of disadvantages, and re-iterate the inherent disadvantage that any language change brings significant cost to the Python core developers and the whole Python community. That's why most such suggestions have a significant hurdle to demonstrate a benefit. -- \ “Prediction is very difficult, especially of the future.” | `\ —Niels Bohr | _o__) | Ben Finney
Stephan: Have you met the coconut project already? https://pypi.python.org/pypi/coconut, http://coconut.readthedocs.io/en/master/DOCS.html They create a superset of Python with the aim to allow writting smaller functional programs, and have a super-featured functional set of operators and such. It is a pure-Python project that pre-compiles their "coconut" program files to .py at compile time - They have a shorter syntax for lambda already - maybe that could be of use to you - and maybe you can get them to accept your suggestion - it certainly would fit there. """ Lambdas Coconut provides the simple, clean -> operator as an alternative to Python’s lambda statements. The operator has the same precedence as the old statement. Rationale In Python, lambdas are ugly and bulky, requiring the entire word lambda to be written out every time one is constructed. This is fine if in-line functions are very rarely needed, but in functional programming in-line functions are an essential tool. Example: dubsums = map((x, y) -> 2*(x+y), range(0, 10), range(10, 20)) """
On 7/13/16, Ben Finney <ben+python@benfinney.id.au> wrote:
Stephan Houben <stephanh42@gmail.com> writes:
* The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation), so it brings Python closer to being "executable pseudo-code".
How is that an advantage?
Could help promote python.
Doesn't this kind of violate Python's "one way to do it"? (Also, sorry for the top post; I'm on mobile right now...) -- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/ On Jul 12, 2016 7:38 AM, "Stephan Houben" <stephanh42@gmail.com> wrote:
Hi list,
Here is my speculative language idea for Python:
Allow the following alternative spelling of the keyword `lambda':
λ
(That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).")
Background:
I have been using the Vim "conceal" functionality with a rule which visually replaces lambda with λ when editing Python files. I find this a great improvement in readability since λ is visually less distracting while still quite distinctive. (The fact that λ is syntax-colored as a keyword also helps with this.)
However, at the moment the nice syntax is lost when looking at the file through another editor or viewer. Therefore I would really like this to be an official part of the Python syntax.
I know people have been clamoring for shorter lambda-syntax in the past, I think this is a nice minimal extension.
Example code:
lst.sort(key=lambda x: x.lookup_first_name()) lst.sort(key=λ x: x.lookup_first_name())
# Church numerals zero = λ f: λ x: x one = λ f: λ x: f(x) two = λ f: λ x: f(f(x))
(Yes, Python is my favorite Scheme dialect. Why did you ask?)
Note that a number of other languages already allow this. (Racket, Haskell).
You can judge the aesthetics of this on your own code with the following sed command.
sed 's/\<lambda\>/λ/g'
Advantages:
* The lambda keyword is quite long and distracts from the "meat" of the lambda expression. Replacing it by a single-character keyword improves readability.
* The resulting code resembles more closely mathematical notation (in particular, lambda-calculus notation), so it brings Python closer to being "executable pseudo-code".
* The alternative spelling λ/lambda is quite intuitive (at least to anybody who knows Greek letters.)
Disadvantages:
For your convenience already noticed here:
* Introducing λ is introducing TIMTOWTDI.
* Hard to type with certain editors. But note that the old syntax is still available. Easy to fix by upgrading to VIM ;-)
* Will turn a pre-existing legal identifier λ into a keyword. So backward-incompatible.
Needless to say, my personal opinion is that the advantages outweigh the disadvantages. ;-)
Greetings,
Stephan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 7/12/16, Stephan Houben <stephanh42@gmail.com> wrote:
Hi list,
Here is my speculative language idea for Python:
Allow the following alternative spelling of the keyword `lambda':
λ
(That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).")
Background:
I have been using the Vim "conceal" functionality with a rule which visually replaces lambda with λ when editing Python files. I find this a great improvement in readability since λ is visually less distracting while still quite distinctive. (The fact that λ is syntax-colored as a keyword also helps with this.)
However, at the moment the nice syntax is lost when looking at the file through another editor or viewer. Therefore I would really like this to be an official part of the Python syntax.
1. What is future of coding? I feel it is not only language what could translate your ideas into reality. Artificial intelligence in (future) editors (and also vim conceal) is probably right way to enhance your coding power (with lambdas too). 2. If we like to enhance python syntax with Unicode characters then I think it is good to see larger context. There is ocean of possibilities how to make it. (probably good possibilities too). For example Unicode could help to add new operators. But it also brings a lot of questions (how to write Knuth's arrow on my editor?) and difficulties (how to give possibilities to implement class functions for this (*) new operators? How to make for example triple arrow possible?). I propose to be prepared before opening Pandora's box. :) (*) all of it? And it means probably all after future enhancement of Unicode too? 3. Questions around "only one possibilities how to write it" could be probably answered with this? a<b a.__lt__(b)
On Wed, Jul 13, 2016 at 3:43 PM, Pavol Lisy <pavol.lisy@gmail.com> wrote:
3. Questions around "only one possibilities how to write it" could be probably answered with this?
a<b a.__lt__(b)
Those aren't the same, though. One is the interface, the other is the implementation. Dunder methods are for defining, not for calling. Also: rosuav@sikorsky:~$ python3 Python 3.6.0a2+ (default:4ef2404d343e, Jul 11 2016, 12:37:20) [GCC 5.3.1 20160528] on linux Type "help", "copyright", "credits" or "license" for more information.
class B: ... def __gt__(self, other): ... print("Am I greater than %s?" % other) ... return False ... a = 5 b = B() a < b Am I greater than 5? False
Operators can have multiple implementations (in this case, the interpreter found that "int < B" didn't have an implementation, so it switched it to b > a and re-evaluated). ChrisA
Pavol Lisy <pavol.lisy@gmail.com> writes:
Questions around "only one possibilities how to write it" could be probably answered with this?
a<b a.__lt__(b)
The maxim is not “only one way”. That is a common misconception, but it is easily dispelled: read the Zen of Python (by ‘import this’ in the interactive prompt). Rather, the maxim is “There should be one obvious way to do it”, with a parenthetical “and preferably only one”. So the emphasis is on the way being *obvious*, and all other ways being non-obvious. This leads, of course, to choosing the best way to also be the one obvious way to do it. Your example above supports this: the comparison ‘a < b’ is the one obvious way to compare whether ‘a’ is less than ‘b’. -- \ “It is forbidden to steal hotel towels. Please if you are not | `\ person to do such is please not to read notice.” —hotel, | _o__) Kowloon, Hong Kong | Ben Finney
On 7/13/16, Ben Finney <ben+python@benfinney.id.au> wrote:
Pavol Lisy <pavol.lisy@gmail.com> writes:
Questions around "only one possibilities how to write it" could be probably answered with this?
a<b a.__lt__(b)
The maxim is not “only one way”. That is a common misconception, but it is easily dispelled: read the Zen of Python (by ‘import this’ in the interactive prompt).
Rather, the maxim is “There should be one obvious way to do it”, with a parenthetical “and preferably only one”.
So the emphasis is on the way being *obvious*, and all other ways being non-obvious. This leads, of course, to choosing the best way to also be the one obvious way to do it.
Your example above supports this: the comparison ‘a < b’ is the one obvious way to compare whether ‘a’ is less than ‘b’.
-- \ “It is forbidden to steal hotel towels. Please if you are not | `\ person to do such is please not to read notice.” —hotel, | _o__) Kowloon, Hong Kong | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I don't support this lambda proposal (in this moment - but probably somebody could convince me). But if we will accept it then Unicode version could be the obvious one couldn't be?
On Tue, Jul 12, 2016 at 11:56 PM, Pavol Lisy <pavol.lisy@gmail.com> wrote:
On 7/13/16, Ben Finney <ben+python@benfinney.id.au> wrote:
Pavol Lisy <pavol.lisy@gmail.com> writes:
Questions around "only one possibilities how to write it" could be probably answered with this?
a<b a.__lt__(b)
The maxim is not “only one way”. That is a common misconception, but it is easily dispelled: read the Zen of Python (by ‘import this’ in the interactive prompt).
Rather, the maxim is “There should be one obvious way to do it”, with a parenthetical “and preferably only one”.
So the emphasis is on the way being *obvious*, and all other ways being non-obvious. This leads, of course, to choosing the best way to also be the one obvious way to do it.
Your example above supports this: the comparison ‘a < b’ is the one obvious way to compare whether ‘a’ is less than ‘b’.
-- \ “It is forbidden to steal hotel towels. Please if you are not | `\ person to do such is please not to read notice.” —hotel, | _o__) Kowloon, Hong Kong | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I don't support this lambda proposal (in this moment - but probably somebody could convince me).
I don't either, but I'm glad it was brought up regardless...
But if we will accept it then Unicode version could be the obvious one couldn't be?
I certainly hope not. As a user with no lambda key on my keyboard, the only way that I know to input one is to do a google search for "unicode greek letter lambda" and copy/paste one of those characters into my editor :-). FWIW, I think that the cons here far outweigh the pros -- As a former physicist, when I see a lambda character, I immediately think of a whole host of things (wavelength!) and none of them is "anonymous function". Perhaps someone who works more with lambda calculus (or with a more rigorous comp-sci background) would disagree -- but my point is that this notation would possibly only serve a small community -- and it could possibly break code for another small group of users who are using lambdas as variable names to be clever (which is another practice that I wouldn't support...) and finally, I think it might just be confusing for other people (1-character non-ascii keywords? If my editor's syntax definition wasn't up-to-date, I'd definitely expect a `SyntaxError` from that). All of that aside, it seems like the pros that the original poster mentioned could be gained by writing a plugin for your editor that makes the swap on save and load. Apparently this already exists for some editors -- Why risk breaking existing code to add a syntax that can be handled by an editor extension?
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- [image: pattern-sig.png] Matt Gilson // SOFTWARE ENGINEER E: matt@getpattern.com // P: 603.892.7736 We’re looking for beta testers. Go here <https://www.getpattern.com/meetpattern> to sign up!
On Tue, Jul 12, 2016 at 7:38 AM, Stephan Houben <stephanh42@gmail.com> wrote:
Hi list,
Here is my speculative language idea for Python:
Allow the following alternative spelling of the keyword `lambda':
λ
(That is "Unicode Character 'GREEK SMALL LETTER LAMDA' (U+03BB).")
Just to be a small data point, I have written code that uses λ as a variable name (as someone mentioned elsewhere in the thread, Jupyter Notebook makes typing Greek characters easy). Because this would break code that I have written, and I suspect it would break other code as well, I am -1 on the proposal. How selfish of me! Cody
participants (30)
-
Alan Cristhian
-
Alexander Belopolsky
-
Ben Finney
-
Bernardo Sulzbach
-
Chris Angelico
-
Cody Piersall
-
Danilo J. S. Bellini
-
David Mertz
-
Ethan Furman
-
Giampaolo Rodola'
-
Joao S. O. Bueno
-
John Wong
-
João Santos
-
Matt Gilson
-
MRAB
-
Neil Girdhar
-
Nick Coghlan
-
Paul Moore
-
Pavol Lisy
-
Random832
-
Rustom Mody
-
Ryan Gonzalez
-
Stephan Houben
-
Stephen J. Turnbull
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sven R. Kunze
-
SW
-
Terry Reedy
-
tritium-list@sdamon.com