[Python-3000] Unicode identifiers (Was: sets in P3K?)

Sat Apr 29 05:07:29 CEST 2006

Guido van Rossum wrote:
>> The distinction of letters and digits is also straight-forward:
>> a digit is ASCII [0-9]; it's a separate lexical class only
>> because it plays a special role in (number) literals. More
>> generally, there is the distinction of starter and non-starter
>> characters.
> 
> But Unicode has many alternative sets digits for which "isdigit" is true.

You mean, the Python isdigit() method? Sure, but the tokenizer uses
the C isdigit function, which gives true only for [0-9]. FWIW, POSIX
allows 6 alternative characters to be defined as hexdigits for
isxdigit, so the tokenizer shouldn't really use isxdigit for
hexadecimal literals.

So from the implementation point of view, nothing much would have
to change: the usage of isalnum in the tokenizer is already wrong,
as it already allows to put non-ASCII characters into identifiers,
if the locale classifies them as alpha-numeric.

I can't see why the Unicode notion of digits should affect the
language specification in any way. The notion of digit is only
used to define what number literals are, and I don't propose
to change the lexical rules for number literals - I propose
to change the rules for identifiers.

> You can as far a the lexer is concerned because the lexer treats
> keywords as "just" identifiers. Only the parser knows which ones are
> really keywords.

Right. But if the identifier syntax was
[:identifier_start:][:identifier_cont:]*
then thinks would work out just fine: identifier_start intersected
with ASCII would be [A-Za-z_], and identifier_cont intersected
with ASCII would be [A-Za-z0-9_]; this would include all keywords.
You would still need punctuation between two subsequent
"identifiers", and that punctuation would have to be ASCII, as
non-ASCII characters would be restricted to comments, string
literals, and identifiers.

Regards,
Martin