[Python-3000] PEP: Supporting Non-ASCII Identifiers

Eric V. Smith eric+python-dev at trueblade.com
Wed May 2 14:26:42 CEST 2007


Martin v. Löwis wrote:
...
> Specification of Language Changes
> =================================
> 
> The syntax of identifiers in Python will be based on the Unicode
> standard annex UAX-31 [1]_, with elaboration and changes as defined
> below.
> 
> Within the ASCII range (U+0001..U+007F), the valid characters for
> identifiers are the same as in Python 2.5. This specification only
> introduces additional characters from outside the ASCII range. For
> other characters, the classification uses the version of the Unicode
> Character Database as included in the unicodedata module.
> 
> The identifier syntax is <ID_Start> <ID_Continue>\*.
> 
> ID_Start is defined as all characters having one of the general
> categories uppercase letters (Lu), lowercase letters (Ll), titlecase
> letters (Lt), modifier letters (Lm), other letters (Lo), letter
> numbers (Nl), plus the underscore (XXX what are "stability extensions
> listed in UAX 31).
> 
> ID_Continue is defined as all characters in ID_Start, plus nonspacing
> marks (Mn), spacing combining marks (Mc), decimal number (Nd), and
> connector punctuations (Pc).
> 
> All identifiers are converted into the normal form NFC while parsing;
> comparison of identifiers is based on NFC.

Martin:

I don't understand Unicode nearly well enough to really comment on this, 
but could you add a comment that the PEP3101 code might need to be 
adjusted to deal with Unicode identifiers?

I don't actually think your PEP would make any difference to how we're 
parsing, because we don't have a "is this a valid character for an 
identifier" function.  But I'd like to get a note somewhere in the PEP 
saying that all code that parses for identifiers might be impacted.  The 
PEP 3101 code is one place where we have such a parser.  We'd at least 
need to implement tests for Unicode identifiers.

Which reminds me that we need better tests for the existing PEP 3101 
code, especially for strings with surrogate pairs.  I'll look at beefing 
that up.

Thanks.

Eric.



More information about the Python-3000 mailing list