Allow additional separator character in variables

Python allows underscore character as a separator in variables. This is better than nothing, still it does not make the look much better. **Proposal**: allow additional separator, namely hyphen character. **Benefits**: this should provide significant readability improvement. Compared to most syntax change proposals that I remember till now, this seems to have really tangible benefits. So all in all I see it as a significant language improvement. Besides its direct benefit as a good looking separator, it also gives an opprtunity to reduce "camelCase" or similar ugly inclusions in code. So one can easily compose human-readable variable names e.g.: my-variable figure-shape---width etc. **Problem**: currently hyphen character is used as a minus operator! The problem is as old as the history of most programming languages, and inherited from initial poorness of character sets. Therefore I don't see a single optimal solution to this. Possible solutions: Solution 1: Use another similar looking character from unicode, for example: U+02D7 (modifier letter minus sign). At the same time IMO it is needed to allow the minus character for the minus operator, namely U+2212 Minus sign. This will allow proper typography of source code. Main benefit of such approach: no breakage of current code base, since new chars are additional to existing ones. Solution 2: (radical) Disallow hyphen as minus operator, and use only U+2212 Minus sign. I.e. "a-b" would be a variable and "a − b" a minus operation. Advantage: opportunity to correct the improper character usage once and for all. Disadvantage: breakage of current code base + force UTF-8 storage use (consider e.g. editors without unicode support). Thus most probably such solution will cause howl reaching to the sky among users, despite many modern editors allow unicode and custom operator styling, for example to distinguish dash from hyphen in a monospaced editor. So is my proposal, and as usual urging for constructive conversation. (i.e. proposing to write own language/translator is not constructive conversation) Cheers, Mikhail

In summary, this proposal seems to be: Give two visually indistinguishible characters different meanings to improve readability. I'm not sure, but something about that sentence doesn't seem quite right. -- Greg

On Sun, Nov 19, 2017 at 1:01 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Since you can already avoid camelCase by using snake_case, I'm not sure how much you really gain by adding the hyphen.
While I agree with "my-variable", I don't like the triple hyphen. What's the benefit?
Both of these create extremely confusing situations, where two nearly-identical symbols have completely different meanings. Solution 2 is a massive backward-compatibility break. You're not just disallowing something that's been legal since the language was introduced - you're giving it a completely different meaning. That's basically a non-starter right there. Solution 1 is at least reasonably plausible, in that you're taking something that's currently a SyntaxError and giving it a valid meaning. There is no code that could be broken by that (AFAIK). However, there's still the problem that you're introducing a marginal benefit and a significant confusion potential; plus, you'd be adding a special case to the Unicode identifier rules, which is not something to be done lightly. How much benefit do you REALLY get from using hyphens rather than underscores? ChrisA

On 19 November 2017 at 12:01, Mikhail V <mikhailwas@gmail.com> wrote:
Regardless of any potential readability merits, backwards compatibility requirements combined with the use of the hyphen character as a binary operator prohibit such a change: >>> my = variable = 1 >>> my-variable 0 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 19 November 2017 at 12:32, Nick Coghlan <ncoghlan@gmail.com> wrote:
Ah, sorry - I now see you addressed the basic version of that. The alternative of "Use a character that computers can distinguish, but humans can't" isn't an improvement, since it means introducing the exact kind of ambiguity that Python seeks to avoid by using indentation for block delimeters (rather than having the computer read braces, and humans read indentation). The difficulty of reliably distinguishing backticks from regular single quotes is also the main reason they're generally discounted from reintroduction for any other use case after their usage as an alternative to the repr builtin was dropped in Python 3.0, and it's why Python 3 prohibits mixing tabs and spaces for indentation by default. For anyone tempted to suggest "What about multiple underscores indicating continuation of the variable name?", that's still a compatibility problem due to the unary minus operator: >>> my--variable 2 >>> my---variable 0 Would hyphens in variable names improve readability sometimes? Potentially, but not enough to live with make binary subtraction expressions ambiguous (hence the consistency amongst almost all current text based programming languages on this point). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Nov 19, 2017 at 3:42 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That seems to be another showcase of misfotune that Python uses hyphen for minus operator. I know it is not language designer's fault, because basic ASCII simply did not not include minus character. But do you realise that the **current** problem you are adressing is that font designers forgot to make the minus character (in monospaced font) distinctive from the hyphen character? Well, what can I say, I just think it should be a reason to make a collective complain to font providers, but not that you should silently accept this and adopt the language design to someone's sloppy font design. As an aid for monospace die-hards, to minimise the confusion one could publish a style-guide that recommends to disclose the minus operator (currently hyphen char) in spaces, like a - b, and probably disallow the new proposed hyphen character in the beginning of the identifiers. That would still leave potential for confusion because you cant' force everyone to follow style-guides, but one should struggle to break from this cycle anyway.
Would hyphens in variable names improve readability sometimes?
For reading code, indeed, always and very much. Of course not in case I would be forced to use monospaced font with a similar minus and hyphen. But in that case I am already accepting the level of readability of 12th century, so this would not make things much worse, and I would simply put spaces around the minus operator and try to highlight it with some strong color. Mikhail

Python does not use U+2010 HYPHEN for the minus operator, it uses the U+002D (-) HYPHEN-MINUS. In some monospace fonts, there is a subtle difference between U+002D, U+2013 EN DASH, and U+2014 EM DASH, but it's usually hard to tell them *all* apart. If you want to make a proposal, I'd suggest that you limit it to allowing the U+2010 HYPHEN to be used for names. U+002D simply cannot be changed because it would break billions of lines of code. On Sat, Nov 18, 2017 at 10:44 PM, Mikhail V <mikhailwas@gmail.com> wrote:

On 19/11/2017 05:01, Nick Timkovich wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that: - it is visually distinguishable even in mono-spaced fonts, (personally I use mono-spaced all of the time when programming but I know that I am a dinosaur), - is actually on many keyboards as a single character, (I don't know of any which actually produce different characters for minus on the numeric keypad and hyphen elsewhere), so can be typed as a single key press, - Is generally unused AFAIK other than in papers about logic, - It is currently unused in the Python language. This might upset some who would like use it to replace the unary not operator but I suspect that it would be far fewer people than the potential breakages discussed so far. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

There is an unfortunate ambiguity in using a character that means "not" as a word separator: nuke.do¬launch() "But... I called the method which explicitly did *not* launch the nuke!" Stephan Op 19 nov. 2017 11:05 schreef "Steve Barnes" <gadgetsteve@live.co.uk>: On 19/11/2017 05:01, Nick Timkovich wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that: - it is visually distinguishable even in mono-spaced fonts, (personally I use mono-spaced all of the time when programming but I know that I am a dinosaur), - is actually on many keyboards as a single character, (I don't know of any which actually produce different characters for minus on the numeric keypad and hyphen elsewhere), so can be typed as a single key press, - Is generally unused AFAIK other than in papers about logic, - It is currently unused in the Python language. This might upset some who would like use it to replace the unary not operator but I suspect that it would be far fewer people than the potential breakages discussed so far. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 11/19/17 1:33 AM, Steve Barnes wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that:
There is NO such character in ASCII. ASCII is a 7 bit character set, and no ASCII code has a value bigger than 127. There are a number of Extended ASCII character sets (hundreds if not thousands). One common one is ISO 8859-1 also called ISO LATIN-1 which has this character at this location, but Extended ASCII is NOT ASCII (Note, it is even produced by a very different standards body). The character also occurs here in ANSI Extended ASCII, but again, this is NOT ASCII. -- Richard Damon

On Sat, Nov 18, 2017 at 8:44 PM, Mikhail V <mikhailwas@gmail.com> wrote:
It is not a misfortune or even true that Python uses hyphen for minus. The name of the character used in Python is HYPHEN-MINUS. http://unicode.org/cldr/utility/character.jsp?a=002D It is both a hyphen and a minus. And it served double-duty even in ASCII. A language that requires using characters not present on standard keyboards is unlikely to be successful. Or we would all be programming in APL. And it's not as if no one every thought of this before. Maybe you've heard of COBOL?
Would hyphens in variable names improve readability sometimes?
For reading code, indeed, always and very much.
No it wouldn't. You're personal preference is hardly authoritative. I am extremely skeptical that a legitimate usability study would find that record-count is better than record_count. There are studies that monospace fonts are harder to read than proportionally spaced, e.g., http://journals.sagepub.com/doi/pdf/10.1177/001872088302500303. Yet many programmers use monospace fonts because the advantages -- in our opinions -- outweigh the disadvantages. And the reality is that only my opinion matters when I'm choosing the fonts to display my code in, not yours. You-know-what-really-would-increase-readability? Allowing-the-use-of-spaces-in-variable-names. As-you-can-see-from-this-example-hyphens-between-words-decreases-readability. And because spaces between words is mostly not valid syntax currently, this change would be easier to introduce than breaking every single program out there by re-purposing hyphen-minus. But I'm not seriously proposing this because I think the modest benefits are outweighed by the many problems it would introduce. --- Bruce

On Nov 18 2017, Bruce Leban <bruce-lcXLltxty2U@public.gmane.org> wrote:
Luckily, there is a compromise: use backticks to quote identifiers: `test mode` = True if `test mode`: `display message`("just a test") I'm not seriously suggesting that, but I still wonder what people think about it. I sort of like it, actually. The `(" part is pretty ugly (which is why I included it in the example), but there's no syntax that can completely avoid ugly corner cases. I think in most cases the context would also make it easy to distinguish single quotes and backticks even when they're typographically similar. Cheers, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

Chris A wrote:
Both of these create extremely confusing situations, where two nearly-identical symbols have completely different meanings.
In reality, hyphen and Minus sign are not even closely similar - Minus is ca. twice as wide, however the citizens of the Monospaced Kingdom may disagree ; ) Though I think its population will dramatically decrease in one or two decades.
Solution 2 is a massive backward-compatibility break.
Yep, although elimination of improper usage is always good thing in longer perspective (and less new additional chars). But I do realise that it is a non-starter.
... a marginal benefit ... How much benefit do you REALLY get from using hyphens rather than underscores?
IMO it's far higher than marginal, at least compared to most syntax proposals I remember. One of the hardest and most important tasks which a programmer is faced, is making readable variable names. Underscores are still one of the MOST ugly things I observe currently in Python syntax. This means, if fixings this, then there will be only "small warts" left (such as e.g. single quotes). For me, one "cheap" solution against underscores is to use syntax highlighting which grays them out, but if those become like spaces, then it becomes a bit confusing, e.g. in function with many arguments. Also, unfortunately, not many editors allow easy (if any) highlighting customisation on that level. One possible solution is to use a custom font that has hyphen instead of the underscore, but this is not a proper solution, because, well, the character standard is still there, regardless I like it or not. And one should still have an alternative, i.e. *not only one* separator, for example to denote something "special". Also it can enrich some semantical emphasis, e.g.: my-variable_global Mikhail

On 19 November 2017 at 13:22, Mikhail V <mikhailwas@gmail.com> wrote:
Changing the way editors display underscore-using variable names still seems like a more productive direction to explore than changing the text encoding read by the compiler. The current source code structure is well-defined and unambiguous, so there's no clear benefit to change things at that level, and significant downsides in terms of complexity, forwards and backwards compatibility concerns, and high barriers to pervasive adoption. By contrast, if the argument for using a different Unicode character is "Editors will reliably display Unicode hyphen characters differently from the way they display minus signs (or vice-versa)", then we can just as easily say "If users are finding the way that text editors display snake_cased_names to be consistently hard to read, then text editors should change the way that they display snake_cased_names (or at least make it easy for users to opt-in to displaying them differently)". For example, they could decide to replace underscores in variable names for display purposes with hyphens plus the underscore combining diacritic, or the combining macron below: - https://en.wikipedia.org/wiki/Underline#Unicode - https://en.wikipedia.org/wiki/Macron_below Then when the cursor was placed inside the variable name, they could revert to displaying those characters as regular underscores. This kind of editor level modification would also extend itself well to underscores in numeric literals, as there the appropriate pseudo-separator shown when the literal wasn't being edited would be locale dependent. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Nov 19, 2017 at 5:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Indeed that would be a solution. *Would* be. But I don't know of any editor that does that afaik (and they should not in this case, see below). My view on pros&cons for this solution: Pros: other languages also have the same issue, so if editors maintainers would agree to compromise and introduce feature of dynamic substitution, that would give users possibility to face-lift other syntaxes as well. Cons: this feature would make sense if the substitution happens only in those part where it should, namely it should not touch anything in string literals, comment blocks. So the lexer should 'know' where to substitute or not and it is not the same as just passing the internal memory representation through a translation table. My opinion about this however is based on other principles. Imagine that you are the language designer and I am responsible for the typesetting component of some editor, and we have such a dialogue: you: "hey Mikhail, we use hyphen for minus operator, now can you please patch the renderer so that our users see the minus instead of hyphen, and please make sure users can also toggle it in real time to see what actual char is there and also make the substitution only in the places where hyphen is used as the operator." me: "well, I understand your complain, but my renderer already supports Unicode, and I do my best to support typography practices, namely render hyphen as *hyphen*, which is well established for centuries in typography, and defined as a dash of 50% width of the letter "o" and is aligned to lowercase. As well as the Minus glyph which is defined as ca. 110% of "o" width and is aligned to the digits&caps. So you as the language designer should be interested to deliver best practices to the users, and hyphen is way more important for the lexical structure of the written language, than the minus operator. Why would not you just try to solve the issue in a "fair" way?" By the fair way I understand the way which tends to bring the correct usage of characters back, instead of trying to hide the problem with some patch. Now I can't say what is the least problematic way for Python, but if I were responsible for that, I would base the solution on these principles: 1. The future versions of syntax, ideally, must allow ONLY minus U2212 for the minus operator, and allow hyphens 002D in identifiers. Since it is impossible to the current moment, I must think out the least painful transition. 2. I want users to be able to use underscore as well. Underscore is derived from the mechanical type-writers - to make an underlined text one pushed the carriage back and tipped the underscore to make the line under the text. Currently in digital print it does not make much sense and as a separator looks ugly, but still it not so hopeless. Currently the underscore lies below the font baseline but if one makes it closer to the baseline, then it can be used as a fairly adequate additional separator, so a user would become more ways to denote lexical identifiers. 3. I don't want to break the backward-compatibility but still I am oriented on compliance with typography practices and standards for charcodes. Also I want users who are interested in better UX become the benefits out-of-the-box, without forcing them to tweak the text-editors or writing own translators. What to do? One option IMO would be to introduce a header in the sources, e.g.: # opt-in: hyphen-minus Which would tell the parser to toggle the "new" rules, namely U+2212 would be parsed as minus operator and hyphens as part of identifiers. Then users who are aware of benefits and remember monospaced fonts only as unpleasant incident from their youth, can enjoy the beauty of source code without any tweaks, and the only thing they need to do is to bind a key to input the U+2212 sign. The users who do not want it, just leave this out. Further, I'd add a command-line util that can directly translate to the "old" syntax, in case one want to export a project in old syntax. So one could avoid backward compatibility issue. That is just one option that comes to my mind. Another thing which might be important in this regard: Say you want to publish a book about Python. With such syntax you could directly import the code into a DTP software, and you don't need to make any corrections, so it looks almost as a normal English text, and no worries about strange looking minus operators. Mikhail

On Mon, Nov 20, 2017 at 11:01 AM, Mikhail V <mikhailwas@gmail.com> wrote:
The least painful transition is to devise an entirely new language, one that is built around whatever rules you like. That way, there's no backward compatibility problem - you pick a new file extension, a new executable name, etc, etc, and nobody gets confused. Of course, since actually building a cross-platform language interpreter is a ton of work, and getting an ecosystem of libraries is even more work, you'll want to make your language compile to Python, but *in your source code* you can use whatever symbols you want. Since you want U+2212 for subtraction, you probably want to use a few other non-ASCII operators too. U+2044 FRACTION SLASH presents itself as a viable way to create a fractions.Fraction literal. Instead of * and @ for multiplication, you could have U+00D7 and... uhh, I'm not a mathematician, but I'm sure there's an appropriate character. For the most part, you'd have code that is trivially transformable to and from Python. Start by writing the "my language to Python" translator (it can throw away comments and stuff, the Python code should be considered "object code" rather than "source code"), and then look into the reverse transformation for the benefit of people trying to learn your language. As long as you don't actually call your language "Python", you're free to do what you like without worrying about compatibility etc. ChrisA

On 2017-11-20 00:20, Chris Angelico wrote:
If we must use U+2212 (MINUS SIGN) for the minus sign, then it's only right that we must also use U+2010 (HYPHEN) for the hyphen. U+002D (HYPHEN-MINUS) can be left alone, its meaning depending on the programming language, as at present.

Bruce Leban wrote:
It is not a misfortune or even true that Python uses hyphen for minus. The name of the character used in Python is HYPHEN-MINUS.
This is pure demagogy, name it HYPHEN-MINUS-TINYDASH if you like, but what aspect of reality does it change apart of its name? "Hyphen-minus" would make sense for mechanical type-writers. So it is a hyphen, a character used for centuries before typewriters even appeared, and used as such now in 99 percent of medium. Just take some Python sources and count the amount of underscores and minus operators. This will give you an image of how important separators are compared to minus operator. Don't forget also to include cases where variables are written without any separator, but should do so.
I am extremely skeptical that a legitimate usability study would find that record-count is better than record_count.
Oh come on, probably you also want study for emoticons as a separators? On Sun, Nov 19, 2017 at 5:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mon, Nov 20, 2017 at 11:18 AM, Mikhail V <mikhailwas@gmail.com> wrote:
If you want to. But a simple a-b test (or is that an a_b test?) of hyphens and underscores would be sufficient. For anecdotal evidence, I prefer to write git branch names with hyphens, eg "git checkout rosuav/process-check-run". It's not about the typing (tab completion means I don't have to type either form), it's about the way it looks. So there definitely is _some_ advantage here. I just don't think it's significant, not worth the hassle of changing things around. And this is still ASCII-only. ChrisA

On Sun, Nov 19, 2017 at 4:18 PM, Mikhail V <mikhailwas@gmail.com> wrote:
You've gone from making a bad suggestion to trolling. While you may *think* this character is a hyphen, you're simply wrong. When ASCII was created it was a 7-bit character set limited to 94 printable characters plus space and 33 control characters. The designers explicitly added a single double-duty character for both hyphen and minus, just as they added a single character for single and double quotes rather than left and right quotes. Were they mimicking the typewriter? Maybe they were following the example of Hollerith code which only had uppercase. It doesn't matter. It's not that they were unaware of the different uses or the existence of typographic quotes. Just as monospace fonts were not created because people didn't know about variable width fonts. It is what it is. And pretending other people are idiots is inappropriate. You can use accent grave as a left quote and apostrophe as a right quote if you want to, but if you insist that Python is living in the dark ages because it doesn't do things *your way* then you're just being rude. ... render
False. There is no standard going back centuries defining the widths of the different kinds of dashes. For that matter, there is no standard *today* for what letters and symbols look like. See Doug Hofstadter's great paper on this https://web.stanford.edu/group/SHR/4-2/text/hofstadter.html or the Unicode consortium list of emoji https://unicode.org/emoji/charts/full-emoji-list.html for great examples of the non-standard nature of typography. Heck almost all vendors put cheese on the *hamburger* emoji when obviously it only belongs on the cheeseburger emoji. And Google puts the cheese below the meat which is clearly wrong as the international standard for cheeseburgers puts the cheese on the top. Just take some Python sources and count the amount of underscores
and minus operators. This will give you an image of how important separators are compared to minus operator.
A non sequitur. Count the number of instances of the letter Z in English vs. the letter E which tells you that Z is unimportant. So let's get rid of it. Of course that may piss off the Polish people since it's the 9th most frequent letter (4.9%) in Polish. While this makes a great story -- see "Meihem In Ce Klasrum" http://www.tau.ac.il/~pauzner/funs/simpler.html -- but not a great reality. That said, no one has argued that a word separator in names is a bad idea and we have two choices: capitalizingEachWord and underscores_between_words. These work well enough that the idea of breaking every single Python program that uses subtraction just because one person believes we are being antediluvian -- without any evidence -- is just not going to happen. (Ooh. See what I did there. I typed two hyphen-minus characters to get an "em dash" and you probably didn't even notice that I was breaking centuries of tradition that the only proper way to write an em dash is with a single piece of metal type.) If you want to make serious contributions to Python or any other project you need to understand why this is a bad idea.
Yes, if someone insisted that emoticons were superior to underscores as separators and implied I was an idiot for not agreeing with that. --- Bruce

Mikhail V writes:
No, the idea is *not* bad, it's just not for Python. As has been true for every one of your ideas for language tweaks that I can recall. There are *millions* of Python programmers by now. There are more lines of Python being written and read in a day than you could write or read in your lifetime. It's just not practical to *change* the meaning of valid lexical constructs this way, and the rules you want could easily have edge cases that confuse a lot of people. We have a lot of experience with such edge cases, both in Python ("else" clauses on loops, and Python 3 itself, come to mind) and out. We don't like them, as a rule, and introduce them only only when they allow a better expression of something that is quite awkward without them, and preferably only when they express new semantics (ie, something previously impossible). If it were just one idea, I'd say "suck it up, Mikhail, and get with the programming language". But your ideas are consistently superficially plausible, taking a few seconds of reflection to see that, yes, they could be done, but they are not going to be accepted in mainline Python. The problem with them is that you propose them for Python, not the specific ideas themselves. The solution is as proposed earlier: create your own language. It shouldn't be excessively hard to write a preprocessor for "mvlang" targeting Python. It has historical precedent: that's how Stroustrup originally implemented C++. It allows smooth interchange of programs with people who know Python, no matter how much you add or change. If, having elaborated all your ideas into this new language, you find yourself unwilling to write in Python, then it's time to publish your language, because other people may feel the same level of attraction to it. But ... it *will* be a different language, not Python. Regards, Steve (not speaking for any other Steves, Stevens, or Stephens) -- Associate Professor Division of Policy and Planning Science http://turnbull/sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

On Tue, Nov 21, 2017 at 2:51 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Not every, but many, yes. And there is plethora of proposals less plausible and not for Python. Anyway, I'll stick to python-list better for such topics. BTW, as per Serhiy Storchaka's note: my·variable myᝍvariable myㅡvariable myⵧvariable myㄧvariable myㆍvariable ^ Is this good idea *for Python*? I mean this is not Python that I knew. I don't know how it is possible. Looks like a result of some unlucky nuclear experment. Might be it will not cause any possible confusion, or less than a hyphen and a minus.
Not much interested in *my own language*. Simple translator for hyphens and minuses I have already made, and I enjoy it. If the new language thing would happen and gained popularity - it would be the worst scenario - competing syntaxes, CO2 emissions, community splittage, etc. I don't endorse such ideas. Mikhail

21.11.17 05:16, Mikhail V пише:
Yes, it causes less confusion that changing meaning of a minus. And yes, it can cause confusion if misused. As well as using the following variables: мyvariable mуvariable myvаriable myvarіable myvariаble myvariaЬle myvariab1e myvariablе But the name моязмінна doesn't cause any confusion if used in an appropriate context (for example in a lesson for young Ukrainian children). I believe the above dot- and hyphen-like characters don't cause confusion if used as letters in an appropriate language context.

Mikhail V writes:
Given that 5 of 6 show up with the glyph for U+FFFD REPLACEMENT CHARACTER in my client, I'd say not (but then, I can always fix my mail client so don't mind me ;-).
It depends on how familiar people and tools are with Unicode. For example, after almost clicking on something from "Apple.co.jp" where the "A" is from the Cyrillic block, my mail program now highlights confusables (there's a list at Unicode.org) and also places where languages I don't use are present in the message preview. It really helps in detecting spam! But most people neither have that knowledge, nor the source code to their mail clients and browsers, to help them with those distinctions. Personally, I think that Python probably should ban non-ASCII non-letter characters in identifiers and whitespace, and maybe add them later in response to requests from native speakers of the relevant languages. I don't know how easy that would be to do, though, since I think the rule is already that identifiers must be composed only of letters, numbers, and ASCII "_". Since Serhiy's examples are valid, we'd have to rule them out explicitly, rather than by reference to the Unicode database. Yuck.
Think of it as evolution in action. So, languages evolve whether you do it yourself or not. It's not a question of you endorsing the idea. Eventually somebody will write a better language than Python. Why not you? The problem, if it's similar to an existing one, is the work it creates for the author of the new language. The worst possible case is something like Python 3. IIUC, Guido's opinion now is that looking back, Python 3 was the right thing to do at the time but he's never gonna do that again, too much work on explaining "why Python 3". The question would be, is it right for *you* to do it for a language with your favorite features? I don't say you *should*, just that you *could*. Regards, Steve

2017-11-21 12:55 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
That would be quite a backward-incompatible change since such identifiers have been legal since Python 3.0.
See: https://www.python.org/dev/peps/pep-313 The identifier syntax is <XID_Start> <XID_Continue>*. ID_Start is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, and characters carrying the Other_ID_Start property. XID_Start then closes this set under normalization, by removing all characters whose NFKC normalization is not of the form ID_Start ID_Continue* anymore. ID_Continue is defined as all characters in ID_Start, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), and characters carryig the Other_ID_Continue property. Again, XID_Continue closes this set under NFKC-normalization; it also adds U+00B7 to support Catalan. Since Serhiy's
examples are valid, we'd have to rule them out explicitly, rather than by reference to the Unicode database. Yuck.
If we take this thinking to its logical extreme we should ban ASCII 1 and l since they can be confused. Also 0 and O. Realistically, this is extremely unlikely to be an issue in practice. If you have people making such malignant code changes with checkin permission, you have bigger problems... Anyway, you can have your linter enforce ASCII or whatever character subset you deem safe. Stephan

Serhiy Storchaka wrote:
Yes, it causes less confusion that changing meaning of a minus.
If those chars are not used at all, then yes :) And I don't recall I was exactly propsing changing meaning of minus
A single word written in local language should not. But its a perfect way to make whole code look like a mess. I think it is very interesting experience to use Cyrillic letters, since many are identical to Latin. So it would not be programming lessons in the first place, but rather constant changing of keyboard layout, and then trying to find unexplainable errors. Mikhail

Hi all, If anybody is still worried about this, here is a 29-line proof-of-concept code checker which warns if your source file contains identifiers which are different but look the same. https://gist.github.com/stephanh42/61eceadc2890cf1b53ada5e48ef98ad1 Stephan 2017-11-21 19:19 GMT+01:00 Mikhail V <mikhailwas@gmail.com>:

Mikhail V writes:
A single word written in local language should not. But its a perfect way to make whole code look like a mess.
Alex Martelli wrote a couple of interesting posts about his experiences with multilingual comments back in the discussion of PEP 263. One of them involved a team from Israel, I think, or maybe South Africa. If you Google site:mail.python.org for Alex and those countries, the thread would probably come up.

On Thu, Nov 23, 2017 at 02:24:16PM +0900, Stephen J. Turnbull wrote:
Either my google-fu is failing or your memory failed you. I've spent an hour and a half googling, and I'm getting nothing relevant. It doesn't help that Alex was a very prolific poster back in the day. Hell, I can't even find where PEP 263 was discussed, apart from a brief discussion here: https://mail.python.org/pipermail/python-dev/2002-July/026449.html which Alex didn't take part in. (And that was a year after the PEP was first created.) https://www.python.org/dev/peps/pep-0263/ -- Steve

On 2017-11-24 00:49, Steven D'Aprano wrote:
I think your search might have been hindered by your ability to spell. :-) Multibyte Character Surport for Python http://grokbase.com/t/python/python-list/0258fms6xa/multibyte-character-surp...

Just for the record, there is also another hyphen, called "soft hyphen", U+00AD. Main difference is that in some software it is an 'interpreted' symbol, and thus may simply disappear from the screen in such software, so it cannot be surely defined as a printable character. OTOH the benefit is that it is 100% present in any font, afaik. I have found a good technical summary about this character: http://jkorpela.fi/shy.html

On 21 November 2017 at 21:55, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
We're not going to start second-guessing the Unicode Consortium on this point - human languages are complicated, and we don't have any special insight on this point that they don't. https://www.python.org/dev/peps/pep-3131/#specification-of-language-changes delegated this aspect of the language to them by way of the XID_Start and the XID_Continue categories, and we're not going to change that. Any hybrid Python 2/3 application or library is necessarily restricted to ASCII-only identifiers, since that's all that Python 2 supports. We've also explicitly retained the ASCII-only restriction for PyPI distribution names (see https://www.python.org/dev/peps/pep-0508/#names), but that doesn't restrict the names used for import packages, only the names used to publish and install those components. If we ever decide to lift that restriction, it will likely be by way of https://en.wikipedia.org/wiki/Punycode, similar to the way internationalized domain names work, as well as the way multi-phase extension module initialization locates init functions for extension modules with non-ASCII names. Beyond that, I'll note that these questions were all raised in the original PEP: https://www.python.org/dev/peps/pep-3131/#open-issues The reference interpreter really isn't the place to experiment with answering them - rather, they're more a question for opt-in code analysis, since that makes it possible for folks to choose settings that are right *for them* (e.g. by defining a set of "permitted scripts" [1], specifying the Unicode characters that should be allowed in identifiers beyond the core set of "Latin" code points allowed by ASCII) Cheers, Nick. [1] https://en.wikipedia.org/wiki/Script_(Unicode) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
Agreed. Python, however, is NOT a (natural) human language, and the Unicode Consortium definition of conformance does NOT prohibit subsetting appropriate to the purpose. We DO know more than the Unicode Consortium about Python. For example, I suspect that your catholic appetite for XID in identifiers does not apply to syntactic keywords or names of builtins.
I agree about experimentation. I'm not in a hurry, since I've only seen IDEOGRAPHIC SPACE and full-width ASCII break Python programs once or twice in the ten years I've been teaching my students to use it. Steve

On 23 November 2017 at 16:34, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
We already have a stricter ASCII-only naming policy for the standard library: https://www.python.org/dev/peps/pep-3131/#policy-specification That's different from placing additional constraints on end-user code, though, as that's where the line between "programming language" and "natural language" gets blurry (variable, function, attribute, and method names are often nouns and verbs in the author's language, and this is also the case for data-derived APIs like pandas column names) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Nov 23, 2017 at 4:46 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Well, then there is some bitter irony in this, so it allows pretty much everything, but does not allow me to beautify code with hyphens. I can fully understand the wish to use non-latin scripts in strings or comments. As for identifiers, IMO, apart from latin letters and underscore, the first unicode candidate I would add is U+2010. And probably the LAST one I would add. Mikhail

On Fri, Nov 24, 2017 at 1:10 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Fortunately for the world, you're not the one who decided which characters were permitted in Python identifiers. The ability to use non-English words for function/variable names is of huge value; the ability to use a hyphen is of some value, but not nearly as much. Can this thread move to python-list? Or, better, to python-rants-about-unicode-list, to which I don't subscribe? ChrisA

On Thu, Nov 23, 2017 at 02:29:43PM +0000, Carl Smith wrote:
Can't we just tell everyone to speak US English, and go back to ASCII? It would be a less painful migration.
I trust you're joking, but it makes me twitchy to see people saying that even in jest, because I've come across folks who *actually do* think that way. -- Steve

Mikhail V wrote:
In reality, hyphen and Minus sign are not even closely similar - Minus is ca. twice as wide,
If you are using a font that distinguishes them that clearly, and if the human reader is sufficiently typographically aware to notice the distinction. Both of those are big ifs. And what about handwritten code? I don't know anyone who handwrites hyphens and minuses with precision measured in points. -- Greg

I guess for reference: exec('a\N{MIDDLE DOT} = 0') exec('\N{BUHID LETTER RA} = 1') exec('\N{HANGUL LETTER EU} = 2') exec('\N{TIFINAGH LETTER YO} = 3') exec('\N{BOPOMOFO LETTER I} = 4') exec('\N{HANGUL LETTER ARAEA} = 5') On Sun, Nov 19, 2017 at 1:38 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

You think that's bad? https://github.com/reinderien/mimic/blob/master/README.md Abandon all hope ye who use Unicode. Op 19 nov. 2017 12:06 schreef "Antoine Pitrou" <solipsis@pitrou.net>:

On 19/11/2017 11:20, Stephan Houben wrote:
You think that's bad? https://github.com/reinderien/mimic/blob/master/README.md
Abandon all hope ye who use Unicode.
Op 19 nov. 2017 12:06 schreef "Antoine Pitrou" <solipsis@pitrou.net
You can get exactly the the same effect by typing command line examples into MS-Word or Outlook and letting people copy and paste to the command line, as both programs "helpfully" and erratically convert double quotes to open/close curly double quote and minus into hyphen, etc. - I have had a large number of cases where this has happened and even start many of my utilities that are deployed to others with a fix parameter issues function that undoes such substitutions in all of sys.argv just to cut down on the number of "it doesn't work" complaints. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

In summary, this proposal seems to be: Give two visually indistinguishible characters different meanings to improve readability. I'm not sure, but something about that sentence doesn't seem quite right. -- Greg

On Sun, Nov 19, 2017 at 1:01 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Since you can already avoid camelCase by using snake_case, I'm not sure how much you really gain by adding the hyphen.
While I agree with "my-variable", I don't like the triple hyphen. What's the benefit?
Both of these create extremely confusing situations, where two nearly-identical symbols have completely different meanings. Solution 2 is a massive backward-compatibility break. You're not just disallowing something that's been legal since the language was introduced - you're giving it a completely different meaning. That's basically a non-starter right there. Solution 1 is at least reasonably plausible, in that you're taking something that's currently a SyntaxError and giving it a valid meaning. There is no code that could be broken by that (AFAIK). However, there's still the problem that you're introducing a marginal benefit and a significant confusion potential; plus, you'd be adding a special case to the Unicode identifier rules, which is not something to be done lightly. How much benefit do you REALLY get from using hyphens rather than underscores? ChrisA

On 19 November 2017 at 12:01, Mikhail V <mikhailwas@gmail.com> wrote:
Regardless of any potential readability merits, backwards compatibility requirements combined with the use of the hyphen character as a binary operator prohibit such a change: >>> my = variable = 1 >>> my-variable 0 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 19 November 2017 at 12:32, Nick Coghlan <ncoghlan@gmail.com> wrote:
Ah, sorry - I now see you addressed the basic version of that. The alternative of "Use a character that computers can distinguish, but humans can't" isn't an improvement, since it means introducing the exact kind of ambiguity that Python seeks to avoid by using indentation for block delimeters (rather than having the computer read braces, and humans read indentation). The difficulty of reliably distinguishing backticks from regular single quotes is also the main reason they're generally discounted from reintroduction for any other use case after their usage as an alternative to the repr builtin was dropped in Python 3.0, and it's why Python 3 prohibits mixing tabs and spaces for indentation by default. For anyone tempted to suggest "What about multiple underscores indicating continuation of the variable name?", that's still a compatibility problem due to the unary minus operator: >>> my--variable 2 >>> my---variable 0 Would hyphens in variable names improve readability sometimes? Potentially, but not enough to live with make binary subtraction expressions ambiguous (hence the consistency amongst almost all current text based programming languages on this point). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Nov 19, 2017 at 3:42 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That seems to be another showcase of misfotune that Python uses hyphen for minus operator. I know it is not language designer's fault, because basic ASCII simply did not not include minus character. But do you realise that the **current** problem you are adressing is that font designers forgot to make the minus character (in monospaced font) distinctive from the hyphen character? Well, what can I say, I just think it should be a reason to make a collective complain to font providers, but not that you should silently accept this and adopt the language design to someone's sloppy font design. As an aid for monospace die-hards, to minimise the confusion one could publish a style-guide that recommends to disclose the minus operator (currently hyphen char) in spaces, like a - b, and probably disallow the new proposed hyphen character in the beginning of the identifiers. That would still leave potential for confusion because you cant' force everyone to follow style-guides, but one should struggle to break from this cycle anyway.
Would hyphens in variable names improve readability sometimes?
For reading code, indeed, always and very much. Of course not in case I would be forced to use monospaced font with a similar minus and hyphen. But in that case I am already accepting the level of readability of 12th century, so this would not make things much worse, and I would simply put spaces around the minus operator and try to highlight it with some strong color. Mikhail

Python does not use U+2010 HYPHEN for the minus operator, it uses the U+002D (-) HYPHEN-MINUS. In some monospace fonts, there is a subtle difference between U+002D, U+2013 EN DASH, and U+2014 EM DASH, but it's usually hard to tell them *all* apart. If you want to make a proposal, I'd suggest that you limit it to allowing the U+2010 HYPHEN to be used for names. U+002D simply cannot be changed because it would break billions of lines of code. On Sat, Nov 18, 2017 at 10:44 PM, Mikhail V <mikhailwas@gmail.com> wrote:

On 19/11/2017 05:01, Nick Timkovich wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that: - it is visually distinguishable even in mono-spaced fonts, (personally I use mono-spaced all of the time when programming but I know that I am a dinosaur), - is actually on many keyboards as a single character, (I don't know of any which actually produce different characters for minus on the numeric keypad and hyphen elsewhere), so can be typed as a single key press, - Is generally unused AFAIK other than in papers about logic, - It is currently unused in the Python language. This might upset some who would like use it to replace the unary not operator but I suspect that it would be far fewer people than the potential breakages discussed so far. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

There is an unfortunate ambiguity in using a character that means "not" as a word separator: nuke.do¬launch() "But... I called the method which explicitly did *not* launch the nuke!" Stephan Op 19 nov. 2017 11:05 schreef "Steve Barnes" <gadgetsteve@live.co.uk>: On 19/11/2017 05:01, Nick Timkovich wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that: - it is visually distinguishable even in mono-spaced fonts, (personally I use mono-spaced all of the time when programming but I know that I am a dinosaur), - is actually on many keyboards as a single character, (I don't know of any which actually produce different characters for minus on the numeric keypad and hyphen elsewhere), so can be typed as a single key press, - Is generally unused AFAIK other than in papers about logic, - It is currently unused in the Python language. This might upset some who would like use it to replace the unary not operator but I suspect that it would be far fewer people than the potential breakages discussed so far. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 11/19/17 1:33 AM, Steve Barnes wrote:
How about allowing ¬, (ASCII 172, U+00ac, NOT sign), in variable names as in my¬variable - it has the advantages that:
There is NO such character in ASCII. ASCII is a 7 bit character set, and no ASCII code has a value bigger than 127. There are a number of Extended ASCII character sets (hundreds if not thousands). One common one is ISO 8859-1 also called ISO LATIN-1 which has this character at this location, but Extended ASCII is NOT ASCII (Note, it is even produced by a very different standards body). The character also occurs here in ANSI Extended ASCII, but again, this is NOT ASCII. -- Richard Damon

On Sat, Nov 18, 2017 at 8:44 PM, Mikhail V <mikhailwas@gmail.com> wrote:
It is not a misfortune or even true that Python uses hyphen for minus. The name of the character used in Python is HYPHEN-MINUS. http://unicode.org/cldr/utility/character.jsp?a=002D It is both a hyphen and a minus. And it served double-duty even in ASCII. A language that requires using characters not present on standard keyboards is unlikely to be successful. Or we would all be programming in APL. And it's not as if no one every thought of this before. Maybe you've heard of COBOL?
Would hyphens in variable names improve readability sometimes?
For reading code, indeed, always and very much.
No it wouldn't. You're personal preference is hardly authoritative. I am extremely skeptical that a legitimate usability study would find that record-count is better than record_count. There are studies that monospace fonts are harder to read than proportionally spaced, e.g., http://journals.sagepub.com/doi/pdf/10.1177/001872088302500303. Yet many programmers use monospace fonts because the advantages -- in our opinions -- outweigh the disadvantages. And the reality is that only my opinion matters when I'm choosing the fonts to display my code in, not yours. You-know-what-really-would-increase-readability? Allowing-the-use-of-spaces-in-variable-names. As-you-can-see-from-this-example-hyphens-between-words-decreases-readability. And because spaces between words is mostly not valid syntax currently, this change would be easier to introduce than breaking every single program out there by re-purposing hyphen-minus. But I'm not seriously proposing this because I think the modest benefits are outweighed by the many problems it would introduce. --- Bruce

On Nov 18 2017, Bruce Leban <bruce-lcXLltxty2U@public.gmane.org> wrote:
Luckily, there is a compromise: use backticks to quote identifiers: `test mode` = True if `test mode`: `display message`("just a test") I'm not seriously suggesting that, but I still wonder what people think about it. I sort of like it, actually. The `(" part is pretty ugly (which is why I included it in the example), but there's no syntax that can completely avoid ugly corner cases. I think in most cases the context would also make it easy to distinguish single quotes and backticks even when they're typographically similar. Cheers, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

Chris A wrote:
Both of these create extremely confusing situations, where two nearly-identical symbols have completely different meanings.
In reality, hyphen and Minus sign are not even closely similar - Minus is ca. twice as wide, however the citizens of the Monospaced Kingdom may disagree ; ) Though I think its population will dramatically decrease in one or two decades.
Solution 2 is a massive backward-compatibility break.
Yep, although elimination of improper usage is always good thing in longer perspective (and less new additional chars). But I do realise that it is a non-starter.
... a marginal benefit ... How much benefit do you REALLY get from using hyphens rather than underscores?
IMO it's far higher than marginal, at least compared to most syntax proposals I remember. One of the hardest and most important tasks which a programmer is faced, is making readable variable names. Underscores are still one of the MOST ugly things I observe currently in Python syntax. This means, if fixings this, then there will be only "small warts" left (such as e.g. single quotes). For me, one "cheap" solution against underscores is to use syntax highlighting which grays them out, but if those become like spaces, then it becomes a bit confusing, e.g. in function with many arguments. Also, unfortunately, not many editors allow easy (if any) highlighting customisation on that level. One possible solution is to use a custom font that has hyphen instead of the underscore, but this is not a proper solution, because, well, the character standard is still there, regardless I like it or not. And one should still have an alternative, i.e. *not only one* separator, for example to denote something "special". Also it can enrich some semantical emphasis, e.g.: my-variable_global Mikhail

On 19 November 2017 at 13:22, Mikhail V <mikhailwas@gmail.com> wrote:
Changing the way editors display underscore-using variable names still seems like a more productive direction to explore than changing the text encoding read by the compiler. The current source code structure is well-defined and unambiguous, so there's no clear benefit to change things at that level, and significant downsides in terms of complexity, forwards and backwards compatibility concerns, and high barriers to pervasive adoption. By contrast, if the argument for using a different Unicode character is "Editors will reliably display Unicode hyphen characters differently from the way they display minus signs (or vice-versa)", then we can just as easily say "If users are finding the way that text editors display snake_cased_names to be consistently hard to read, then text editors should change the way that they display snake_cased_names (or at least make it easy for users to opt-in to displaying them differently)". For example, they could decide to replace underscores in variable names for display purposes with hyphens plus the underscore combining diacritic, or the combining macron below: - https://en.wikipedia.org/wiki/Underline#Unicode - https://en.wikipedia.org/wiki/Macron_below Then when the cursor was placed inside the variable name, they could revert to displaying those characters as regular underscores. This kind of editor level modification would also extend itself well to underscores in numeric literals, as there the appropriate pseudo-separator shown when the literal wasn't being edited would be locale dependent. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Nov 19, 2017 at 5:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Indeed that would be a solution. *Would* be. But I don't know of any editor that does that afaik (and they should not in this case, see below). My view on pros&cons for this solution: Pros: other languages also have the same issue, so if editors maintainers would agree to compromise and introduce feature of dynamic substitution, that would give users possibility to face-lift other syntaxes as well. Cons: this feature would make sense if the substitution happens only in those part where it should, namely it should not touch anything in string literals, comment blocks. So the lexer should 'know' where to substitute or not and it is not the same as just passing the internal memory representation through a translation table. My opinion about this however is based on other principles. Imagine that you are the language designer and I am responsible for the typesetting component of some editor, and we have such a dialogue: you: "hey Mikhail, we use hyphen for minus operator, now can you please patch the renderer so that our users see the minus instead of hyphen, and please make sure users can also toggle it in real time to see what actual char is there and also make the substitution only in the places where hyphen is used as the operator." me: "well, I understand your complain, but my renderer already supports Unicode, and I do my best to support typography practices, namely render hyphen as *hyphen*, which is well established for centuries in typography, and defined as a dash of 50% width of the letter "o" and is aligned to lowercase. As well as the Minus glyph which is defined as ca. 110% of "o" width and is aligned to the digits&caps. So you as the language designer should be interested to deliver best practices to the users, and hyphen is way more important for the lexical structure of the written language, than the minus operator. Why would not you just try to solve the issue in a "fair" way?" By the fair way I understand the way which tends to bring the correct usage of characters back, instead of trying to hide the problem with some patch. Now I can't say what is the least problematic way for Python, but if I were responsible for that, I would base the solution on these principles: 1. The future versions of syntax, ideally, must allow ONLY minus U2212 for the minus operator, and allow hyphens 002D in identifiers. Since it is impossible to the current moment, I must think out the least painful transition. 2. I want users to be able to use underscore as well. Underscore is derived from the mechanical type-writers - to make an underlined text one pushed the carriage back and tipped the underscore to make the line under the text. Currently in digital print it does not make much sense and as a separator looks ugly, but still it not so hopeless. Currently the underscore lies below the font baseline but if one makes it closer to the baseline, then it can be used as a fairly adequate additional separator, so a user would become more ways to denote lexical identifiers. 3. I don't want to break the backward-compatibility but still I am oriented on compliance with typography practices and standards for charcodes. Also I want users who are interested in better UX become the benefits out-of-the-box, without forcing them to tweak the text-editors or writing own translators. What to do? One option IMO would be to introduce a header in the sources, e.g.: # opt-in: hyphen-minus Which would tell the parser to toggle the "new" rules, namely U+2212 would be parsed as minus operator and hyphens as part of identifiers. Then users who are aware of benefits and remember monospaced fonts only as unpleasant incident from their youth, can enjoy the beauty of source code without any tweaks, and the only thing they need to do is to bind a key to input the U+2212 sign. The users who do not want it, just leave this out. Further, I'd add a command-line util that can directly translate to the "old" syntax, in case one want to export a project in old syntax. So one could avoid backward compatibility issue. That is just one option that comes to my mind. Another thing which might be important in this regard: Say you want to publish a book about Python. With such syntax you could directly import the code into a DTP software, and you don't need to make any corrections, so it looks almost as a normal English text, and no worries about strange looking minus operators. Mikhail

On Mon, Nov 20, 2017 at 11:01 AM, Mikhail V <mikhailwas@gmail.com> wrote:
The least painful transition is to devise an entirely new language, one that is built around whatever rules you like. That way, there's no backward compatibility problem - you pick a new file extension, a new executable name, etc, etc, and nobody gets confused. Of course, since actually building a cross-platform language interpreter is a ton of work, and getting an ecosystem of libraries is even more work, you'll want to make your language compile to Python, but *in your source code* you can use whatever symbols you want. Since you want U+2212 for subtraction, you probably want to use a few other non-ASCII operators too. U+2044 FRACTION SLASH presents itself as a viable way to create a fractions.Fraction literal. Instead of * and @ for multiplication, you could have U+00D7 and... uhh, I'm not a mathematician, but I'm sure there's an appropriate character. For the most part, you'd have code that is trivially transformable to and from Python. Start by writing the "my language to Python" translator (it can throw away comments and stuff, the Python code should be considered "object code" rather than "source code"), and then look into the reverse transformation for the benefit of people trying to learn your language. As long as you don't actually call your language "Python", you're free to do what you like without worrying about compatibility etc. ChrisA

On 2017-11-20 00:20, Chris Angelico wrote:
If we must use U+2212 (MINUS SIGN) for the minus sign, then it's only right that we must also use U+2010 (HYPHEN) for the hyphen. U+002D (HYPHEN-MINUS) can be left alone, its meaning depending on the programming language, as at present.

Bruce Leban wrote:
It is not a misfortune or even true that Python uses hyphen for minus. The name of the character used in Python is HYPHEN-MINUS.
This is pure demagogy, name it HYPHEN-MINUS-TINYDASH if you like, but what aspect of reality does it change apart of its name? "Hyphen-minus" would make sense for mechanical type-writers. So it is a hyphen, a character used for centuries before typewriters even appeared, and used as such now in 99 percent of medium. Just take some Python sources and count the amount of underscores and minus operators. This will give you an image of how important separators are compared to minus operator. Don't forget also to include cases where variables are written without any separator, but should do so.
I am extremely skeptical that a legitimate usability study would find that record-count is better than record_count.
Oh come on, probably you also want study for emoticons as a separators? On Sun, Nov 19, 2017 at 5:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mon, Nov 20, 2017 at 11:18 AM, Mikhail V <mikhailwas@gmail.com> wrote:
If you want to. But a simple a-b test (or is that an a_b test?) of hyphens and underscores would be sufficient. For anecdotal evidence, I prefer to write git branch names with hyphens, eg "git checkout rosuav/process-check-run". It's not about the typing (tab completion means I don't have to type either form), it's about the way it looks. So there definitely is _some_ advantage here. I just don't think it's significant, not worth the hassle of changing things around. And this is still ASCII-only. ChrisA

On Sun, Nov 19, 2017 at 4:18 PM, Mikhail V <mikhailwas@gmail.com> wrote:
You've gone from making a bad suggestion to trolling. While you may *think* this character is a hyphen, you're simply wrong. When ASCII was created it was a 7-bit character set limited to 94 printable characters plus space and 33 control characters. The designers explicitly added a single double-duty character for both hyphen and minus, just as they added a single character for single and double quotes rather than left and right quotes. Were they mimicking the typewriter? Maybe they were following the example of Hollerith code which only had uppercase. It doesn't matter. It's not that they were unaware of the different uses or the existence of typographic quotes. Just as monospace fonts were not created because people didn't know about variable width fonts. It is what it is. And pretending other people are idiots is inappropriate. You can use accent grave as a left quote and apostrophe as a right quote if you want to, but if you insist that Python is living in the dark ages because it doesn't do things *your way* then you're just being rude. ... render
False. There is no standard going back centuries defining the widths of the different kinds of dashes. For that matter, there is no standard *today* for what letters and symbols look like. See Doug Hofstadter's great paper on this https://web.stanford.edu/group/SHR/4-2/text/hofstadter.html or the Unicode consortium list of emoji https://unicode.org/emoji/charts/full-emoji-list.html for great examples of the non-standard nature of typography. Heck almost all vendors put cheese on the *hamburger* emoji when obviously it only belongs on the cheeseburger emoji. And Google puts the cheese below the meat which is clearly wrong as the international standard for cheeseburgers puts the cheese on the top. Just take some Python sources and count the amount of underscores
and minus operators. This will give you an image of how important separators are compared to minus operator.
A non sequitur. Count the number of instances of the letter Z in English vs. the letter E which tells you that Z is unimportant. So let's get rid of it. Of course that may piss off the Polish people since it's the 9th most frequent letter (4.9%) in Polish. While this makes a great story -- see "Meihem In Ce Klasrum" http://www.tau.ac.il/~pauzner/funs/simpler.html -- but not a great reality. That said, no one has argued that a word separator in names is a bad idea and we have two choices: capitalizingEachWord and underscores_between_words. These work well enough that the idea of breaking every single Python program that uses subtraction just because one person believes we are being antediluvian -- without any evidence -- is just not going to happen. (Ooh. See what I did there. I typed two hyphen-minus characters to get an "em dash" and you probably didn't even notice that I was breaking centuries of tradition that the only proper way to write an em dash is with a single piece of metal type.) If you want to make serious contributions to Python or any other project you need to understand why this is a bad idea.
Yes, if someone insisted that emoticons were superior to underscores as separators and implied I was an idiot for not agreeing with that. --- Bruce

Mikhail V writes:
No, the idea is *not* bad, it's just not for Python. As has been true for every one of your ideas for language tweaks that I can recall. There are *millions* of Python programmers by now. There are more lines of Python being written and read in a day than you could write or read in your lifetime. It's just not practical to *change* the meaning of valid lexical constructs this way, and the rules you want could easily have edge cases that confuse a lot of people. We have a lot of experience with such edge cases, both in Python ("else" clauses on loops, and Python 3 itself, come to mind) and out. We don't like them, as a rule, and introduce them only only when they allow a better expression of something that is quite awkward without them, and preferably only when they express new semantics (ie, something previously impossible). If it were just one idea, I'd say "suck it up, Mikhail, and get with the programming language". But your ideas are consistently superficially plausible, taking a few seconds of reflection to see that, yes, they could be done, but they are not going to be accepted in mainline Python. The problem with them is that you propose them for Python, not the specific ideas themselves. The solution is as proposed earlier: create your own language. It shouldn't be excessively hard to write a preprocessor for "mvlang" targeting Python. It has historical precedent: that's how Stroustrup originally implemented C++. It allows smooth interchange of programs with people who know Python, no matter how much you add or change. If, having elaborated all your ideas into this new language, you find yourself unwilling to write in Python, then it's time to publish your language, because other people may feel the same level of attraction to it. But ... it *will* be a different language, not Python. Regards, Steve (not speaking for any other Steves, Stevens, or Stephens) -- Associate Professor Division of Policy and Planning Science http://turnbull/sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

On Tue, Nov 21, 2017 at 2:51 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Not every, but many, yes. And there is plethora of proposals less plausible and not for Python. Anyway, I'll stick to python-list better for such topics. BTW, as per Serhiy Storchaka's note: my·variable myᝍvariable myㅡvariable myⵧvariable myㄧvariable myㆍvariable ^ Is this good idea *for Python*? I mean this is not Python that I knew. I don't know how it is possible. Looks like a result of some unlucky nuclear experment. Might be it will not cause any possible confusion, or less than a hyphen and a minus.
Not much interested in *my own language*. Simple translator for hyphens and minuses I have already made, and I enjoy it. If the new language thing would happen and gained popularity - it would be the worst scenario - competing syntaxes, CO2 emissions, community splittage, etc. I don't endorse such ideas. Mikhail

21.11.17 05:16, Mikhail V пише:
Yes, it causes less confusion that changing meaning of a minus. And yes, it can cause confusion if misused. As well as using the following variables: мyvariable mуvariable myvаriable myvarіable myvariаble myvariaЬle myvariab1e myvariablе But the name моязмінна doesn't cause any confusion if used in an appropriate context (for example in a lesson for young Ukrainian children). I believe the above dot- and hyphen-like characters don't cause confusion if used as letters in an appropriate language context.

Mikhail V writes:
Given that 5 of 6 show up with the glyph for U+FFFD REPLACEMENT CHARACTER in my client, I'd say not (but then, I can always fix my mail client so don't mind me ;-).
It depends on how familiar people and tools are with Unicode. For example, after almost clicking on something from "Apple.co.jp" where the "A" is from the Cyrillic block, my mail program now highlights confusables (there's a list at Unicode.org) and also places where languages I don't use are present in the message preview. It really helps in detecting spam! But most people neither have that knowledge, nor the source code to their mail clients and browsers, to help them with those distinctions. Personally, I think that Python probably should ban non-ASCII non-letter characters in identifiers and whitespace, and maybe add them later in response to requests from native speakers of the relevant languages. I don't know how easy that would be to do, though, since I think the rule is already that identifiers must be composed only of letters, numbers, and ASCII "_". Since Serhiy's examples are valid, we'd have to rule them out explicitly, rather than by reference to the Unicode database. Yuck.
Think of it as evolution in action. So, languages evolve whether you do it yourself or not. It's not a question of you endorsing the idea. Eventually somebody will write a better language than Python. Why not you? The problem, if it's similar to an existing one, is the work it creates for the author of the new language. The worst possible case is something like Python 3. IIUC, Guido's opinion now is that looking back, Python 3 was the right thing to do at the time but he's never gonna do that again, too much work on explaining "why Python 3". The question would be, is it right for *you* to do it for a language with your favorite features? I don't say you *should*, just that you *could*. Regards, Steve

2017-11-21 12:55 GMT+01:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
That would be quite a backward-incompatible change since such identifiers have been legal since Python 3.0.
See: https://www.python.org/dev/peps/pep-313 The identifier syntax is <XID_Start> <XID_Continue>*. ID_Start is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, and characters carrying the Other_ID_Start property. XID_Start then closes this set under normalization, by removing all characters whose NFKC normalization is not of the form ID_Start ID_Continue* anymore. ID_Continue is defined as all characters in ID_Start, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), and characters carryig the Other_ID_Continue property. Again, XID_Continue closes this set under NFKC-normalization; it also adds U+00B7 to support Catalan. Since Serhiy's
examples are valid, we'd have to rule them out explicitly, rather than by reference to the Unicode database. Yuck.
If we take this thinking to its logical extreme we should ban ASCII 1 and l since they can be confused. Also 0 and O. Realistically, this is extremely unlikely to be an issue in practice. If you have people making such malignant code changes with checkin permission, you have bigger problems... Anyway, you can have your linter enforce ASCII or whatever character subset you deem safe. Stephan

Serhiy Storchaka wrote:
Yes, it causes less confusion that changing meaning of a minus.
If those chars are not used at all, then yes :) And I don't recall I was exactly propsing changing meaning of minus
A single word written in local language should not. But its a perfect way to make whole code look like a mess. I think it is very interesting experience to use Cyrillic letters, since many are identical to Latin. So it would not be programming lessons in the first place, but rather constant changing of keyboard layout, and then trying to find unexplainable errors. Mikhail

Hi all, If anybody is still worried about this, here is a 29-line proof-of-concept code checker which warns if your source file contains identifiers which are different but look the same. https://gist.github.com/stephanh42/61eceadc2890cf1b53ada5e48ef98ad1 Stephan 2017-11-21 19:19 GMT+01:00 Mikhail V <mikhailwas@gmail.com>:

Mikhail V writes:
A single word written in local language should not. But its a perfect way to make whole code look like a mess.
Alex Martelli wrote a couple of interesting posts about his experiences with multilingual comments back in the discussion of PEP 263. One of them involved a team from Israel, I think, or maybe South Africa. If you Google site:mail.python.org for Alex and those countries, the thread would probably come up.

On Thu, Nov 23, 2017 at 02:24:16PM +0900, Stephen J. Turnbull wrote:
Either my google-fu is failing or your memory failed you. I've spent an hour and a half googling, and I'm getting nothing relevant. It doesn't help that Alex was a very prolific poster back in the day. Hell, I can't even find where PEP 263 was discussed, apart from a brief discussion here: https://mail.python.org/pipermail/python-dev/2002-July/026449.html which Alex didn't take part in. (And that was a year after the PEP was first created.) https://www.python.org/dev/peps/pep-0263/ -- Steve

On 2017-11-24 00:49, Steven D'Aprano wrote:
I think your search might have been hindered by your ability to spell. :-) Multibyte Character Surport for Python http://grokbase.com/t/python/python-list/0258fms6xa/multibyte-character-surp...

Just for the record, there is also another hyphen, called "soft hyphen", U+00AD. Main difference is that in some software it is an 'interpreted' symbol, and thus may simply disappear from the screen in such software, so it cannot be surely defined as a printable character. OTOH the benefit is that it is 100% present in any font, afaik. I have found a good technical summary about this character: http://jkorpela.fi/shy.html

On 21 November 2017 at 21:55, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
We're not going to start second-guessing the Unicode Consortium on this point - human languages are complicated, and we don't have any special insight on this point that they don't. https://www.python.org/dev/peps/pep-3131/#specification-of-language-changes delegated this aspect of the language to them by way of the XID_Start and the XID_Continue categories, and we're not going to change that. Any hybrid Python 2/3 application or library is necessarily restricted to ASCII-only identifiers, since that's all that Python 2 supports. We've also explicitly retained the ASCII-only restriction for PyPI distribution names (see https://www.python.org/dev/peps/pep-0508/#names), but that doesn't restrict the names used for import packages, only the names used to publish and install those components. If we ever decide to lift that restriction, it will likely be by way of https://en.wikipedia.org/wiki/Punycode, similar to the way internationalized domain names work, as well as the way multi-phase extension module initialization locates init functions for extension modules with non-ASCII names. Beyond that, I'll note that these questions were all raised in the original PEP: https://www.python.org/dev/peps/pep-3131/#open-issues The reference interpreter really isn't the place to experiment with answering them - rather, they're more a question for opt-in code analysis, since that makes it possible for folks to choose settings that are right *for them* (e.g. by defining a set of "permitted scripts" [1], specifying the Unicode characters that should be allowed in identifiers beyond the core set of "Latin" code points allowed by ASCII) Cheers, Nick. [1] https://en.wikipedia.org/wiki/Script_(Unicode) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
Agreed. Python, however, is NOT a (natural) human language, and the Unicode Consortium definition of conformance does NOT prohibit subsetting appropriate to the purpose. We DO know more than the Unicode Consortium about Python. For example, I suspect that your catholic appetite for XID in identifiers does not apply to syntactic keywords or names of builtins.
I agree about experimentation. I'm not in a hurry, since I've only seen IDEOGRAPHIC SPACE and full-width ASCII break Python programs once or twice in the ten years I've been teaching my students to use it. Steve

On 23 November 2017 at 16:34, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
We already have a stricter ASCII-only naming policy for the standard library: https://www.python.org/dev/peps/pep-3131/#policy-specification That's different from placing additional constraints on end-user code, though, as that's where the line between "programming language" and "natural language" gets blurry (variable, function, attribute, and method names are often nouns and verbs in the author's language, and this is also the case for data-derived APIs like pandas column names) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Nov 23, 2017 at 4:46 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Well, then there is some bitter irony in this, so it allows pretty much everything, but does not allow me to beautify code with hyphens. I can fully understand the wish to use non-latin scripts in strings or comments. As for identifiers, IMO, apart from latin letters and underscore, the first unicode candidate I would add is U+2010. And probably the LAST one I would add. Mikhail

On Fri, Nov 24, 2017 at 1:10 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Fortunately for the world, you're not the one who decided which characters were permitted in Python identifiers. The ability to use non-English words for function/variable names is of huge value; the ability to use a hyphen is of some value, but not nearly as much. Can this thread move to python-list? Or, better, to python-rants-about-unicode-list, to which I don't subscribe? ChrisA

On Thu, Nov 23, 2017 at 02:29:43PM +0000, Carl Smith wrote:
Can't we just tell everyone to speak US English, and go back to ASCII? It would be a less painful migration.
I trust you're joking, but it makes me twitchy to see people saying that even in jest, because I've come across folks who *actually do* think that way. -- Steve

Mikhail V wrote:
In reality, hyphen and Minus sign are not even closely similar - Minus is ca. twice as wide,
If you are using a font that distinguishes them that clearly, and if the human reader is sufficiently typographically aware to notice the distinction. Both of those are big ifs. And what about handwritten code? I don't know anyone who handwrites hyphens and minuses with precision measured in points. -- Greg

I guess for reference: exec('a\N{MIDDLE DOT} = 0') exec('\N{BUHID LETTER RA} = 1') exec('\N{HANGUL LETTER EU} = 2') exec('\N{TIFINAGH LETTER YO} = 3') exec('\N{BOPOMOFO LETTER I} = 4') exec('\N{HANGUL LETTER ARAEA} = 5') On Sun, Nov 19, 2017 at 1:38 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

You think that's bad? https://github.com/reinderien/mimic/blob/master/README.md Abandon all hope ye who use Unicode. Op 19 nov. 2017 12:06 schreef "Antoine Pitrou" <solipsis@pitrou.net>:

On 19/11/2017 11:20, Stephan Houben wrote:
You think that's bad? https://github.com/reinderien/mimic/blob/master/README.md
Abandon all hope ye who use Unicode.
Op 19 nov. 2017 12:06 schreef "Antoine Pitrou" <solipsis@pitrou.net
You can get exactly the the same effect by typing command line examples into MS-Word or Outlook and letting people copy and paste to the command line, as both programs "helpfully" and erratically convert double quotes to open/close curly double quote and minus into hyphen, etc. - I have had a large number of cases where this has happened and even start many of my utilities that are deployed to others with a fix parameter issues function that undoes such substitutions in all of sys.argv just to cut down on the number of "it doesn't work" complaints. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com
participants (18)
-
Antoine Pitrou
-
Antoine Rozo
-
Bruce Leban
-
Carl Smith
-
Chris Angelico
-
Greg Ewing
-
Guido van Rossum
-
Mikhail V
-
MRAB
-
Nick Coghlan
-
Nick Timkovich
-
Nikolaus Rath
-
Richard Damon
-
Serhiy Storchaka
-
Stephan Houben
-
Stephen J. Turnbull
-
Steve Barnes
-
Steven D'Aprano