[Python-3000] Support for PEP 3131

Jim Jewett jimjjewett at gmail.com
Tue May 22 22:29:02 CEST 2007


On 5/22/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

> That's why Java and C++ use \u, so you would write L\u00F6wis
> as an identifier. ...
> I think you are really arguing for \u escapes in identifiers here.

Yes, that is effectively what I was suggesting.

> *This* is truly unambiguous. I claim that it is also useless.

It means users could see the usability benefits of PEP3131, but the
python internals could still work with ASCII only.

It simplifies checking for identifiers that *don't* stick to ASCII,
which reduces some of the concerns about confusable characters, and
which ones to allow.

Short list of judgment calls that we need to resolve if we go with
non-ASCII identifiers, but can largely ignore if we just use escaping:

Based only on UAX 31:

    ID vs XID  (unicode changed their mind on recommendations)

    include stability extensions?  (*Python* didn't allow those
letters previously.)

    which of ID_CONTINUE should be left out.  (We don't want "-", and
some of the punctuation and other marks may be closer to "-" than to
"_".  Or they might not be, and I don't know how to judge that.)

    layout and control charcters (At the top of section 2, tr31
recommends acting as though they weren't there ... but if we use a
normal (unicode) string, then they will still affect the hash.  Down
in 2.2, they say not to permit them, except sometimes...)

    Canonicalization

    Combining Marks should be accepted (only as continuation chars),
but not if they're enclosing marks, because ... well, I'm not sure,
but I'll have to trust them.

    Specific character Adjustments (sec 2.3) -- The example suggests
that we might have to tailor for our use of "_", though I didn't get
that from the table.  They do suggest tailoring out certain
Decomposition Types.

    Additional (non-letter?) characters which may occur in words (see
UAX29, but I don't claim to fully understand it)

    Undefined code points, particularly those which might be defined later?

    Should we exclude the letters that look like punctuation?  A
proposed update (http://www.unicode.org/reports/tr31/tr31-8.html)
mentions U+02B9 (modifier letter prime) only because the visually
equivalent U+0374 (Greek Numeral Sign) shouldn't be an identifier, but
does fold to it under (some?) canonicalization.  (They suggest
allowing both, instead of neither.)

Then TR 39 http://www.unicode.org/reports/tr39/ recommends excluding
(most, but not all of)

    characters not in modern use;

    characters only used in specialized fields, such as liturgical
characters, mathematical letter-like symbols, and certain phonetic
alphabetics;

    and ideographic characters that are not part of a set of core CJK
ideographs consisting of the CJK Unified Ideographs block plus IICore
(the set of characters defined by the IRG as the minimal set of
required ideographs for East Asian use).

They summarize this in
http://www.unicode.org/reports/tr39/data/xidmodifications.txt; I
wouldn't add the hyphen-minus back in, but I don't know whether
katakana middle dot should be allowed.

Should mixed-script identifiers be allowed?  According to TR 36
(http://www.unicode.org/reports/tr36/) ASCII only is the safest, and
that is followed by limits on mixed-script identifiers.  Those limits
sound reasonable to me, but ... I'm not the one who would be mixing
them.

Note that even "highly restrictive" allows ASCII + Han + Hiragana +
Katakana, ASCII + Han + Bopomofo, and ASCII + Han + Hangul.  (I think
we wanted at least the ASCII numbers with anything.)

-jJ


More information about the Python-3000 mailing list