[Python-3000] Unicode identifiers (Was: sets in P3K?)

"Martin v. Löwis" martin at v.loewis.de
Sun Apr 30 01:51:26 CEST 2006


Guido van Rossum wrote:
>> > But Unicode has many alternative sets digits for which "isdigit" is
>> true.
>>
>> You mean, the Python isdigit() method? Sure, but the tokenizer uses
>> the C isdigit function, which gives true only for [0-9].
> 
> Isn't that because it's only defined on 8-bit characters though?

No: the C standard requires that isdigit is true if and only if
the character is from [0-9]; it also requires that the digits must
have subsequent ordinals in the "execution character set", and that
they must be represented using a single char (rather than requiring
multiple bytes).

Currently, the tokenizer operates on UTF-8, which is multi-byte,
but still, isdigit works "correctly".

> And if we're talking about Unicode, why shouldn't we use the Unicode
> isdigit()? After all you were talking about the Unicode consortium's
> rules for which characters can be part of identifiers.

The tokenizer doesn't use isdigit() to determine what an identifier is;
it uses isalnum(). The parser uses isdigit only to determine what a
number literal is - I don't propose to change that. The Unicode
consortium rules are listed here:

http://www.unicode.org/reports/tr31/

This recommendation mentions two classes ID_Start and ID_Continue:

ID_Start: Uppercase letters, lowercase letters, titlecase letters,
modifier letters, other letters, letter numbers, stability extensions

ID_Continue: All of the above, plus nonspacing marks, spacing combining
marks, decimal numbers, connector punctuations, stability extensions.
These are also known simply as Identifier Characters, since they are a
superset of the  ID_Start. The set of ID_Start characters minus the
ID_Continue characters are known as ID_Only_ Continue characters.

In the implementation, a compact table should be used to determine
whether a character is ID_Start or ID_Continue, instead of calling
some library function.

There are some problems with the UAX#31 definitions IIRC, although
I forgot the exact details (might be that the underscore is missing,
or that the dollar is allowed); the definitions should be adjusted
so that they match the current language for ASCII.

>> FWIW, POSIX
>> allows 6 alternative characters to be defined as hexdigits for
>> isxdigit, so the tokenizer shouldn't really use isxdigit for
>> hexadecimal literals.
> 
> I think if we're talking Unicode, POSIX is irrelevant though, right?

What I'm saying is that the tokenizer currently uses isxdigit;
it should stop doing so (whether or not Unicode identifiers become
part of the language).

As source code would (still) be parsed as UTF-8, isxdigit would
continue to "work", but definitely shouldn't be used anymore.

> But we force the locale to be C, right? I've never heard of someone
> who managed to type non-ASCII letters into identifiers, and I'm sure
> it would've been reported as a bug.

Python 2.3.5 (#2, Mar  6 2006, 10:12:24)
[GCC 4.0.3 20060304 (prerelease) (Debian 4.0.2-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
py> import locale
py> locale.setlocale(locale.LC_ALL, "")
'de_DE at euro'
py> löwis=1
py> print löwis
1

We don't force the C locale - we just happen to start with it
initially. We shouldn't change it later, as that isn't thread-safe.
Nobody reported it, because people just don't try to do that,
except in interactive mode.

>> I can't see why the Unicode notion of digits should affect the
>> language specification in any way. The notion of digit is only
>> used to define what number literals are, and I don't propose
>> to change the lexical rules for number literals - I propose
>> to change the rules for identifiers.
> 
> Well identifiers can contain digits too.

Sure. But they dont' "count" as digits then, lexically - they
are ID_Continue characters (which is a superset of digits).
So what we need is to extend the definition of ID_Continue,
not the definition of digits.

> I do think that *eventually* we'll have to support this. But I don't
> think Python needs to lead the pack here; I don't think the tools are
> ready yet.

Python doesn't really lead here. The C family of languages (C, C++,
Java, C#) all have Unicode identifiers, so there is plenty of
experience. Primarily, the experience is that the feature isn't
used much, because of obstacles I think we can overcome (primarily,
that all these languages make the source encoding
implementation-defined; we don't, as we put the source encoding into
the source file).

Regards,
Martin


More information about the Python-3000 mailing list