Allowing non-ASCII identifiers

I'd like to work on adding support for non-ASCII characters in identifiers, using the following principles: 1. At run-time, identifiers are represented as Unicode objects unless they are pure ASCII. IOW, they are converted from the source encoding to Unicode objects in the process of parsing. 2. As a consequence of 1), all places there identifiers appear need to support Unicode objects (e.g. __dict__, __getattr__, etc) 3. Legal non-ASCII identifiers are what legal non-ASCII identifiers are in Java, except that Python may use a different version of the Unicode character database. Python would share the property that future versions allow more characters in identifiers than older versions. If you are too lazy too look up the Java definition, here is a rough overview: An identifier is "JavaLetter JavaLetterOrDigit*" JavaLetter is a character of the classes Lu, Ll, Lt, Lm, or Lo, or a currency symbol (for Python: excluding $), or a connecting punctuation character (which is unfortunately underspecified - will research the implementation). JavaLetterOrDigit is a JavaLetter, or a digit, a numeric letter, a combining mark, a non-spacing mark, or an ignorable control character. Does this need a PEP? Regards, Martin

Sure does. Since this could create a serious burden for code protability, I'd like to see a serious section on motivation and discussion on how to keep Unicode out of the standard library and out of most 3rd party distributions. Without that I'm strongly -1. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido van Rossum]
There are two matters here. One is the technical portability, the other is more related to human issues. The technical portability burden is fairly limited to the `coding:' magic, and not very different from it. As long as the coding is known to Python, various modules in a single application could use each their coding and consequently, each their lexical behaviour for identifiers. It should not be much of a problem in practice. For the standard library, say, it is a matter of ruling out `coding:' within developers. The human issues are another thing, and I would guess -- without being fully sure -- that this is where your reluctance stands. There is a fear that people would not only comment in German (I'm using German as an example, of course), or use German strings, but also choose identifiers based on German words instead of English words, making programs less easy to read by English people, or relieving a bit of this constant pressure all programmers on this planet have to learn English, pressure which is probably deemed useful towards this uniformity. Almost all "foreigners" already know that if they aim the international community while programming, English is likely the best way to communicate, and they consequently comment in English, and choose their identifiers for being meaningful to English readers. But, the same as not all software is meant to be free (whatever that word means), not all software is meant for international distribution. Some English readers might not really imagine, but it is a constant misery, having to mangle identifiers while documenting and thinking in languages other than English, merely because the Python notion of letter is limited to the English subset. Granted, keywords and standard library use English, this is Python, and this is not at stake here! However, there is a good part of code in local (or in-house) programs which is thought as our crafted code, and even the linguistic change is useful (to us) for segregating between what comes from the language and what comes from us. The idea is extremely appealing of being able to craft and polish our code (comments, strings, identifiers) to make it as nice as it could get, while thinking in our native, natural language. -- François Pinard http://www.iro.umontreal.ca/~pinard

"François Pinard" <pinard@iro.umontreal.ca> wrote in message news:20040209162704.GB7467@titan.progiciels-bpi.ca...
There are two matters here. One is the technical portability, the other is more related to human issues.
I think there are two human issues: language and alphabet (character set) used to transcribe the language.
This, of course, can be and is being done now as long as one drops euro accents (I know, eleve is probably ugly to you) or transliterates other alphabets to the English subset of latin chars. For instance, if I remember correctly, part of the Python scripting for Blade of Darkness was done by Spanish-speaking programmers using Spanish identifiers and comments. There have also been snippets posted on c.l.p with German and other Euro languange words. But I can potentially read, understand, and even edit such code. For another example, if I read for wa in konichi: ... konichi[wa] = <something> I can recognize that the two occurences each of 'konichi' and 'wa' are the same, and that konichi is dict-like, even if I do not know their English meanings (assuming they are not gibberish). But if they were in Japanese chars, for instance, matching names to names would be *much* harder. Having at various times more or less learned, and partly forgotten, how to at least phonetically read words in Cyrillic, Greek, and Devanagari (Sanskrit) characters, I can appreciate that learning Latin chars must also be a chore to those who learned something else as children. But I also know that I do not necessarily wish to learn a dozen more sets, nor would I wish such on everyone else. Having said this against the proposal, I suspect that a Unicode-identifier version of Python, official or not, is inevitable, especially if, as and when Python spreads beyond the internationalized elite of each country. If so, I would like to see such at least accompanied by transliteration programs (preferably two-way) using the most accepted transliteration scheme for each alphabet. Terry J. Reedy

[Terry Reedy]
I think there are two human issues: language and alphabet (character set) used to transcribe the language.
[...] I can potentially read, understand, and even edit such code. For another example, if I read
Hello, Terry, and other Python developers. Some people "pronounce" in their head, and recognise the music :-). For them, phonetic rendering is a help. Others would not have much trouble recognising recurring images, even if they do not associate sounds with them. Japanese or Korean people used to write to me once in a while, and I was quickly recognising their names written with their own script, without a real need to look at the accompanying phonetic transcription. You know, when I read English words, I do not pronounce them correctly in my head. English people do not see the difficulty for a foreigner to get the proper pronunciation (not even speaking of tonic accents, which I surely have all wrong most of the times!). English looks especially difficult to me, phonetically-wise. French is my own language, and I worked for a while on automatic phonetic transcriptions, and despite French looks more "regular" than English on that topic, I know for having done that automatic transcription is a difficult exercise.
But I also know that I do not necessarily wish to learn a dozen more sets, nor would I wish such on everyone else.
Nobody is really asking you to do so. As long as my co-workers and I could read each other, we are happy. We do not have the need that our code be understandable or easy everywhere else, this is not a problem. And if Japanese Pythoneers use Kanji or Hiragana, and are happy with them, I would surely not come and consider their pleasure is not worth, merely because I would need some adaptation to handle their code. The game here is to be comfortable in contexts where work is not meant to be widely disseminated.
Such tools could be undoubtedly fun to have, but I suspect they would not really serve a strong need. When I read an English Python program, maybe that I would like a tool that will transliterate that English program into some phonetic alphabet so I can figure out how it should sound -- but this is a difficult and unnecessary challenge, deep down. I do not know Asian languages, yet people explained to me that the correspondence between pictogram and phonetics is far from being easy. The same pictogram could be used with different meanings and have different pronunciation. Even with the same meaning, the pronunciation could be very different, not only between regions, but also between countries (for example, Chinese and Japanese could figure out each other pictogram for many of them, but have absolutely no commonalty in pronunciation). The other way, from Hiragana to Kanji for Japanese, is also very difficult, because quite ambiguous as well. All this to say that best would be to let people be comfortable in their own language while using Python, without sketching difficult pre-requisites over the project, which do not buy us much in practice. In my previous works on internationalisation, one of the most recurrent problem I observed is people trying to solve others' problems, which they do not have themselves (they want to save the World![*]). A typical example would be that some people object to non-ASCII identifiers in Python, foreseeing the difficulties others might have to find proper editors for handling these sources. One can exhaust himself into such discussions. The proper attitude in such matters, in my opinion, is for one to concentrate on the technical problems related to Python implementation, yet without taking everything else on one's shoulders. If the job is well done enough, people will happily take care and adapt. -------------------- [*] (For trekkies:) They all have a little Wesley Crusher in them! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

On Wed, Jan 14, 2004, "Martin v. Löwis" wrote:
Is that an actual space between the characters?
Does this need a PEP?
Yes. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

[Martin von Löwis]
I'd like to work on adding support for non-ASCII characters in identifiers[...]
Such a support would surely be extremely welcome to me, and to most of my co-workers. There is likely many teams around this planet that would appreciate it as well. Tell me if you think I may help somehow, despite my modest means (I'm over-loaded with duties already, but this is the story for most of us).
This is already the case, isn't it?
2. As a consequence of 1), all places there identifiers appear need to support Unicode objects (e.g. __dict__, __getattr__, etc)
I do not much know the internals, yet I suspect one more thing to consider is whether Unicode strings looking like non-ASCII identifiers should be interned or not, the same as currently done for ASCII.
If you are too lazy too look up the Java definition, here is a rough overview: An identifier is "JavaLetter JavaLetterOrDigit*"
Then, maybe we should be a tad conservative whenever there is some doubt, rather than sticking too closely to Java. It is easier to open a bit more later, than to close what was opened. For example, all currency symbols might be verboten to start with. Or maybe not. Connecting punctuation characters might be limited to the underline to start with, and may be also added into JavaLetterOrDigit. A sure thing is that underlines should be allowed embedded within non-ASCII identifiers. Is the unbreakable space a "connecting punctuation"? :-) Just for the amusement, I noticed that if file `francais.py' contains: -----------------------------------------------------------------------> # -*- coding: Latin-1 -*- élève = 3 print élève -----------------------------------------------------------------------< and file `francais' contains: -----------------------------------------------------------------------> import locale locale.setlocale(locale.LC_ALL, '') import francais -----------------------------------------------------------------------< then command `python francais', in my environment where `LANG' is set to `fr_CA.ISO-8859-1', does yield: ----------------------------------------------------------------------> 3 ----------------------------------------------------------------------< So, the Python compiler is sensitive to the active locale. Someone pointed out, a good while ago, that Latin-1 characters were accepted interactively because `readline' was setting the locale, but it seems that setting the locale ourselves allows for batch import as well. This is kind of an happy bug! May we count on it being supported in the interim? :-) :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

François Pinard wrote:
Currently, all identifiers are byte strings, at run-time, representing ASCII characters. IOW, you currently won't observe Unicode strings as identifiers.
Indeed; I had not thought about this.
Yes, that's a bug. It will use byte strings as identifiers (without running your example, I'd expect that dir() shows they are UTF-8)
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-) Martin

[Martin von Löwis]
This is already the case, isn't it?
Oops, sorry. I misread your sentence as limiting itself to identifiers. I thought having read that the effect of `coding:' was to convert the whole source to Unicode before the scanner pass. This is all from fuzzy memory.
Indeed; I had not thought about this.
This is only an optimisation consideration, which might be premature. On the other hand, speed considerations should not be serious enough to play against one willing to write national identifiers, in the long run.
Yes, that's a bug. It will use byte strings as identifiers (without running your example, I'd expect that dir() shows they are UTF-8)
They seem to be Latin-1. Consider that characters could not be classified correctly in a Latin-1 environment, if they were UTF-8.
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-)
Would Guido accept to pronounce on that? :-) As much as we starve to write our things in better French, we would not allow ourselves to write code that will likely break later. We work in a production environment. A Python command option to set the locale from environment variables, before compilation of `__main__' starts, but that might be too much effort in the wrong direction. Best and simplest would be that the `coding:' pragma really drives the character set used in a file. -- François Pinard http://www.iro.umontreal.ca/~pinard

[François Pinard]
This is already the case, isn't it?
Re-oops! Really, I ought to be tired for writing so ambiguously. Should have written something more like: Oops, sorry. I misread your sentence, and missed the fact that it was limiting itself to identifiers. I thought I once read that the effect of `coding:' ... [etc.]
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-)
Would Guido accept to pronounce on that? :-)
I'm still ambiguous above... Tss, tss! Would Guido pronounce on the fact that the bug will _not_ be corrected, at least not until Python supports non-ASCII identifiers in more complete and correct ways? -- François Pinard http://www.iro.umontreal.ca/~pinard

Sure does. Since this could create a serious burden for code protability, I'd like to see a serious section on motivation and discussion on how to keep Unicode out of the standard library and out of most 3rd party distributions. Without that I'm strongly -1. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido van Rossum]
There are two matters here. One is the technical portability, the other is more related to human issues. The technical portability burden is fairly limited to the `coding:' magic, and not very different from it. As long as the coding is known to Python, various modules in a single application could use each their coding and consequently, each their lexical behaviour for identifiers. It should not be much of a problem in practice. For the standard library, say, it is a matter of ruling out `coding:' within developers. The human issues are another thing, and I would guess -- without being fully sure -- that this is where your reluctance stands. There is a fear that people would not only comment in German (I'm using German as an example, of course), or use German strings, but also choose identifiers based on German words instead of English words, making programs less easy to read by English people, or relieving a bit of this constant pressure all programmers on this planet have to learn English, pressure which is probably deemed useful towards this uniformity. Almost all "foreigners" already know that if they aim the international community while programming, English is likely the best way to communicate, and they consequently comment in English, and choose their identifiers for being meaningful to English readers. But, the same as not all software is meant to be free (whatever that word means), not all software is meant for international distribution. Some English readers might not really imagine, but it is a constant misery, having to mangle identifiers while documenting and thinking in languages other than English, merely because the Python notion of letter is limited to the English subset. Granted, keywords and standard library use English, this is Python, and this is not at stake here! However, there is a good part of code in local (or in-house) programs which is thought as our crafted code, and even the linguistic change is useful (to us) for segregating between what comes from the language and what comes from us. The idea is extremely appealing of being able to craft and polish our code (comments, strings, identifiers) to make it as nice as it could get, while thinking in our native, natural language. -- François Pinard http://www.iro.umontreal.ca/~pinard

"François Pinard" <pinard@iro.umontreal.ca> wrote in message news:20040209162704.GB7467@titan.progiciels-bpi.ca...
There are two matters here. One is the technical portability, the other is more related to human issues.
I think there are two human issues: language and alphabet (character set) used to transcribe the language.
This, of course, can be and is being done now as long as one drops euro accents (I know, eleve is probably ugly to you) or transliterates other alphabets to the English subset of latin chars. For instance, if I remember correctly, part of the Python scripting for Blade of Darkness was done by Spanish-speaking programmers using Spanish identifiers and comments. There have also been snippets posted on c.l.p with German and other Euro languange words. But I can potentially read, understand, and even edit such code. For another example, if I read for wa in konichi: ... konichi[wa] = <something> I can recognize that the two occurences each of 'konichi' and 'wa' are the same, and that konichi is dict-like, even if I do not know their English meanings (assuming they are not gibberish). But if they were in Japanese chars, for instance, matching names to names would be *much* harder. Having at various times more or less learned, and partly forgotten, how to at least phonetically read words in Cyrillic, Greek, and Devanagari (Sanskrit) characters, I can appreciate that learning Latin chars must also be a chore to those who learned something else as children. But I also know that I do not necessarily wish to learn a dozen more sets, nor would I wish such on everyone else. Having said this against the proposal, I suspect that a Unicode-identifier version of Python, official or not, is inevitable, especially if, as and when Python spreads beyond the internationalized elite of each country. If so, I would like to see such at least accompanied by transliteration programs (preferably two-way) using the most accepted transliteration scheme for each alphabet. Terry J. Reedy

[Terry Reedy]
I think there are two human issues: language and alphabet (character set) used to transcribe the language.
[...] I can potentially read, understand, and even edit such code. For another example, if I read
Hello, Terry, and other Python developers. Some people "pronounce" in their head, and recognise the music :-). For them, phonetic rendering is a help. Others would not have much trouble recognising recurring images, even if they do not associate sounds with them. Japanese or Korean people used to write to me once in a while, and I was quickly recognising their names written with their own script, without a real need to look at the accompanying phonetic transcription. You know, when I read English words, I do not pronounce them correctly in my head. English people do not see the difficulty for a foreigner to get the proper pronunciation (not even speaking of tonic accents, which I surely have all wrong most of the times!). English looks especially difficult to me, phonetically-wise. French is my own language, and I worked for a while on automatic phonetic transcriptions, and despite French looks more "regular" than English on that topic, I know for having done that automatic transcription is a difficult exercise.
But I also know that I do not necessarily wish to learn a dozen more sets, nor would I wish such on everyone else.
Nobody is really asking you to do so. As long as my co-workers and I could read each other, we are happy. We do not have the need that our code be understandable or easy everywhere else, this is not a problem. And if Japanese Pythoneers use Kanji or Hiragana, and are happy with them, I would surely not come and consider their pleasure is not worth, merely because I would need some adaptation to handle their code. The game here is to be comfortable in contexts where work is not meant to be widely disseminated.
Such tools could be undoubtedly fun to have, but I suspect they would not really serve a strong need. When I read an English Python program, maybe that I would like a tool that will transliterate that English program into some phonetic alphabet so I can figure out how it should sound -- but this is a difficult and unnecessary challenge, deep down. I do not know Asian languages, yet people explained to me that the correspondence between pictogram and phonetics is far from being easy. The same pictogram could be used with different meanings and have different pronunciation. Even with the same meaning, the pronunciation could be very different, not only between regions, but also between countries (for example, Chinese and Japanese could figure out each other pictogram for many of them, but have absolutely no commonalty in pronunciation). The other way, from Hiragana to Kanji for Japanese, is also very difficult, because quite ambiguous as well. All this to say that best would be to let people be comfortable in their own language while using Python, without sketching difficult pre-requisites over the project, which do not buy us much in practice. In my previous works on internationalisation, one of the most recurrent problem I observed is people trying to solve others' problems, which they do not have themselves (they want to save the World![*]). A typical example would be that some people object to non-ASCII identifiers in Python, foreseeing the difficulties others might have to find proper editors for handling these sources. One can exhaust himself into such discussions. The proper attitude in such matters, in my opinion, is for one to concentrate on the technical problems related to Python implementation, yet without taking everything else on one's shoulders. If the job is well done enough, people will happily take care and adapt. -------------------- [*] (For trekkies:) They all have a little Wesley Crusher in them! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

On Wed, Jan 14, 2004, "Martin v. Löwis" wrote:
Is that an actual space between the characters?
Does this need a PEP?
Yes. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

[Martin von Löwis]
I'd like to work on adding support for non-ASCII characters in identifiers[...]
Such a support would surely be extremely welcome to me, and to most of my co-workers. There is likely many teams around this planet that would appreciate it as well. Tell me if you think I may help somehow, despite my modest means (I'm over-loaded with duties already, but this is the story for most of us).
This is already the case, isn't it?
2. As a consequence of 1), all places there identifiers appear need to support Unicode objects (e.g. __dict__, __getattr__, etc)
I do not much know the internals, yet I suspect one more thing to consider is whether Unicode strings looking like non-ASCII identifiers should be interned or not, the same as currently done for ASCII.
If you are too lazy too look up the Java definition, here is a rough overview: An identifier is "JavaLetter JavaLetterOrDigit*"
Then, maybe we should be a tad conservative whenever there is some doubt, rather than sticking too closely to Java. It is easier to open a bit more later, than to close what was opened. For example, all currency symbols might be verboten to start with. Or maybe not. Connecting punctuation characters might be limited to the underline to start with, and may be also added into JavaLetterOrDigit. A sure thing is that underlines should be allowed embedded within non-ASCII identifiers. Is the unbreakable space a "connecting punctuation"? :-) Just for the amusement, I noticed that if file `francais.py' contains: -----------------------------------------------------------------------> # -*- coding: Latin-1 -*- élève = 3 print élève -----------------------------------------------------------------------< and file `francais' contains: -----------------------------------------------------------------------> import locale locale.setlocale(locale.LC_ALL, '') import francais -----------------------------------------------------------------------< then command `python francais', in my environment where `LANG' is set to `fr_CA.ISO-8859-1', does yield: ----------------------------------------------------------------------> 3 ----------------------------------------------------------------------< So, the Python compiler is sensitive to the active locale. Someone pointed out, a good while ago, that Latin-1 characters were accepted interactively because `readline' was setting the locale, but it seems that setting the locale ourselves allows for batch import as well. This is kind of an happy bug! May we count on it being supported in the interim? :-) :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

François Pinard wrote:
Currently, all identifiers are byte strings, at run-time, representing ASCII characters. IOW, you currently won't observe Unicode strings as identifiers.
Indeed; I had not thought about this.
Yes, that's a bug. It will use byte strings as identifiers (without running your example, I'd expect that dir() shows they are UTF-8)
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-) Martin

[Martin von Löwis]
This is already the case, isn't it?
Oops, sorry. I misread your sentence as limiting itself to identifiers. I thought having read that the effect of `coding:' was to convert the whole source to Unicode before the scanner pass. This is all from fuzzy memory.
Indeed; I had not thought about this.
This is only an optimisation consideration, which might be premature. On the other hand, speed considerations should not be serious enough to play against one willing to write national identifiers, in the long run.
Yes, that's a bug. It will use byte strings as identifiers (without running your example, I'd expect that dir() shows they are UTF-8)
They seem to be Latin-1. Consider that characters could not be classified correctly in a Latin-1 environment, if they were UTF-8.
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-)
Would Guido accept to pronounce on that? :-) As much as we starve to write our things in better French, we would not allow ourselves to write code that will likely break later. We work in a production environment. A Python command option to set the locale from environment variables, before compilation of `__main__' starts, but that might be too much effort in the wrong direction. Best and simplest would be that the `coding:' pragma really drives the character set used in a file. -- François Pinard http://www.iro.umontreal.ca/~pinard

[François Pinard]
This is already the case, isn't it?
Re-oops! Really, I ought to be tired for writing so ambiguously. Should have written something more like: Oops, sorry. I misread your sentence, and missed the fact that it was limiting itself to identifiers. I thought I once read that the effect of `coding:' ... [etc.]
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-)
Would Guido accept to pronounce on that? :-)
I'm still ambiguous above... Tss, tss! Would Guido pronounce on the fact that the bug will _not_ be corrected, at least not until Python supports non-ASCII identifiers in more complete and correct ways? -- François Pinard http://www.iro.umontreal.ca/~pinard
participants (5)
-
"Martin v. Löwis"
-
Aahz
-
François Pinard
-
Guido van Rossum
-
Terry Reedy