[Python-3000] Unicode identifiers (Was: sets in P3K?)

Sat Apr 29 00:21:11 CEST 2006

On 4/28/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Guido van Rossum wrote:
> >> I was hoping to propose a PEP on non-ASCII identifiers some
> >> day; that would (of course) include a requirement that the
> >> standard library would always be restricted to ASCII-only
> >> identifiers as a style-guide.
> >
> > IMO communication about code becomes much more cumbersome if there are
> > non-ASCII letters in identifiers, and the rules about what's a letter,
> > what's a digit, and what separates two identifiers become murky.
>
> It depends on the language you use to communicate. In English,
> it is certainly cumbersome to talk about Chinese identifiers.
> OTOH, I believe it is cumbersome to communicate about English
> identifiers in Chinese, either, because the speakers might
> not even know what the natural-language concept behind the
> identifiers is, and because they can't pronounce the identifier.

True; but (and I realize we're just pitting our beliefs against each
other here) I believe that Chinese computer users are more likely to
be able (and know how) to type characters in the Latin alphabet than
Western programmers are able to type Chinese. For example, I notice
that baidu.cn (a Chinese search engine) spells its own name (a big
brand in China) using the Latin alphabet. I expect that Chinese users
are used to typing "baidu.cn" in their browser's search bar, rather
than Chinese characters.

> As for lexical aspects: these are really straight-forward.
> In principal, it would be possible to allow any non-ASCII
> character as part of an identifier: all punctuation is ASCII,
> so anything non-ASCII can't possibly be punctuation for the
> language. However, that much freedom would be confusing;
> the Unicode consortium has established rules of what characters
> should be allowed in identifiers, and these rules intend to
> match the intuition of the users of these characters.

Right; for example the "nice" quotes, dashes and ellipsis that Word
likes to generate are not ASCII, but they feel like punctuation to me,
and it would be very confusing if they were allowed in identifiers.

> The distinction of letters and digits is also straight-forward:
> a digit is ASCII [0-9]; it's a separate lexical class only
> because it plays a special role in (number) literals. More
> generally, there is the distinction of starter and non-starter
> characters.

But Unicode has many alternative sets digits for which "isdigit" is true.

> An identifier ends when the first non-identifier character
> is encountered (although I don't think there are many places
> in Python where you can have two identifiers immediately following
> each other).

You can as far a the lexer is concerned because the lexer treats
keywords as "just" identifiers. Only the parser knows which ones are
really keywords.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)