
[Martin von Löwis]
I'd like to work on adding support for non-ASCII characters in identifiers[...]
Such a support would surely be extremely welcome to me, and to most of my co-workers. There is likely many teams around this planet that would appreciate it as well. Tell me if you think I may help somehow, despite my modest means (I'm over-loaded with duties already, but this is the story for most of us).
1. At run-time, identifiers are represented as Unicode objects unless they are pure ASCII. IOW, they are converted from the source encoding to Unicode objects in the process of parsing.
This is already the case, isn't it?
2. As a consequence of 1), all places there identifiers appear need to support Unicode objects (e.g. __dict__, __getattr__, etc)
I do not much know the internals, yet I suspect one more thing to consider is whether Unicode strings looking like non-ASCII identifiers should be interned or not, the same as currently done for ASCII.
3. Legal non-ASCII identifiers are what legal non-ASCII identifiers are in Java, except that Python may use a different version of the Unicode character database. Python would share the property that future versions allow more characters in identifiers than older versions.
If you are too lazy too look up the Java definition, here is a rough overview: An identifier is "JavaLetter JavaLetterOrDigit*"
JavaLetter is a character of the classes Lu, Ll, Lt, Lm, or Lo, or a currency symbol (for Python: excluding $), or a connecting punctuation character (which is unfortunately underspecified - will research the implementation).
JavaLetterOrDigit is a JavaLetter, or a digit, a numeric letter, a combining mark, a non-spacing mark, or an ignorable control character.
Then, maybe we should be a tad conservative whenever there is some doubt, rather than sticking too closely to Java. It is easier to open a bit more later, than to close what was opened. For example, all currency symbols might be verboten to start with. Or maybe not. Connecting punctuation characters might be limited to the underline to start with, and may be also added into JavaLetterOrDigit. A sure thing is that underlines should be allowed embedded within non-ASCII identifiers. Is the unbreakable space a "connecting punctuation"? :-) Just for the amusement, I noticed that if file `francais.py' contains: -----------------------------------------------------------------------> # -*- coding: Latin-1 -*- élève = 3 print élève -----------------------------------------------------------------------< and file `francais' contains: -----------------------------------------------------------------------> import locale locale.setlocale(locale.LC_ALL, '') import francais -----------------------------------------------------------------------< then command `python francais', in my environment where `LANG' is set to `fr_CA.ISO-8859-1', does yield: ----------------------------------------------------------------------> 3 ----------------------------------------------------------------------< So, the Python compiler is sensitive to the active locale. Someone pointed out, a good while ago, that Latin-1 characters were accepted interactively because `readline' was setting the locale, but it seems that setting the locale ourselves allows for batch import as well. This is kind of an happy bug! May we count on it being supported in the interim? :-) :-) -- François Pinard http://www.iro.umontreal.ca/~pinard