
[Martin von Löwis]
François Pinard wrote:
1. At run-time, identifiers are represented as Unicode objects unless they are pure ASCII. IOW, they are converted from the source encoding to Unicode objects in the process of parsing.
This is already the case, isn't it?
Currently, all identifiers are byte strings, at run-time, representing ASCII characters. IOW, you currently won't observe Unicode strings as identifiers.
Oops, sorry. I misread your sentence as limiting itself to identifiers. I thought having read that the effect of `coding:' was to convert the whole source to Unicode before the scanner pass. This is all from fuzzy memory.
I do not much know the internals, yet I suspect one more thing to consider is whether Unicode strings looking like non-ASCII identifiers should be interned or not, the same as currently done for ASCII.
Indeed; I had not thought about this.
This is only an optimisation consideration, which might be premature. On the other hand, speed considerations should not be serious enough to play against one willing to write national identifiers, in the long run.
# -*- coding: Latin-1 -*- élève = 3 print élève [...] So, the Python compiler is sensitive to the active locale.
Yes, that's a bug. It will use byte strings as identifiers (without running your example, I'd expect that dir() shows they are UTF-8)
They seem to be Latin-1. Consider that characters could not be classified correctly in a Latin-1 environment, if they were UTF-8.
This is kind of an happy bug! May we count on it being supported in the interim? :-) :-)
I would think so: this bug has been present for quite some time, and nobody complained :-)
Would Guido accept to pronounce on that? :-) As much as we starve to write our things in better French, we would not allow ourselves to write code that will likely break later. We work in a production environment. A Python command option to set the locale from environment variables, before compilation of `__main__' starts, but that might be too much effort in the wrong direction. Best and simplest would be that the `coding:' pragma really drives the character set used in a file. -- François Pinard http://www.iro.umontreal.ca/~pinard