[Python-3000] PEP 3131 - the details

Fri May 18 17:24:19 CEST 2007

On 5/17/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > is it generally agreed that the
> > Unicode character classes listed in the PEP are the ones we want to
> > include in identifiers? My preference is to be conservative in terms of
> > what's allowed.

> John Nagle suggested to consider UTR#39
> (http://unicode.org/reports/tr39/). I encourage anybody to help me
> understand what it says.

> The easiest part is 3.1: this seems to say we should restrict characters
> listed as "restrict" in [idmod]. My suggestion would be to warn about
> them. I'm not sure about the purpose of the additional characters:
> surely, they don't think we should support HYPHEN-MINUS in identifiers?

Rather, they mean that it is commonly used (Lisp and DNS names, at
least), and is (deemed by them as) safe (given that you have applied
their exclusions, such as the dashes).  Python should still use a
tailoring and exclude it.

 > 4. Confusable Detection: Without considering details, it seems you need
> two strings to decide whether they are confusable. So it's not clear
> to me how this could apply to banning certain identifiers.

In most cases, the strings are confusable because individual characters are.

TR 39 makes it sound more complicated than it need to be, because they
want to permit all sorts of strangeness, so long as it is at least
unambiguous strangeness.

My take:

Single-script confusables are things like "1" vs "l", and it is
probably too late to fight them.

Whole-script confusables are cases where two scripts look alike; you
can get something looking like "scope" in either Latin or Cyrillic.
If we're going to allow non-Latin identifiers, then we'll probably
have to live with this.

Mixed-script confusables are spoofing that wouldn't work if you
insisted that any single identifier stick to a "single" script.
('pаypаl', with Cyrillic 'а's).

Their algorithm talks about entire strings because they want to allow
'toys-я-us'.
Technically, Latin doesn't have a character that looks like a
backwards-R, and Cyrillic doesn't have matches for *all of* "toys us".
 Personally, I don't see a strong need to support toys_я_us just
because it would be possible.

On the other hand, I'm not sure how often users of non-latin languages
will want to mix in latin letters.  The tech report suggested that it
is fairly common to use all of (Hiragana | Katakana | Han | Latin) in
Japanese text, but I'm not sure whether it would be normal to mix them
within a single identifier.

 > 5. Mixed Script Detection: That might apply, but I can't map the
> algorithm to terminology I'm familiar with. What is UScript.COMMON
> and UScript.INHERITED?

Those are characters used in many different languages, such as the From TR 24

    http://www.unicode.org/reports/tr24/

    Inherited—for characters that may be used with multiple scripts,
and inherit their script from the preceding characters. Includes
nonspacing marks, enclosing marks, and the zero width
joiner/non-joiner characters.

    Common—for other characters that may be used with multiple scripts.

> I'm skeptical about mixed-script detection,
> because you surely want to allow ASCII digits (0..9) in Cyrillic

According to http://www.unicode.org/Public/UNIDATA/Scripts.txt, the 52
letters [A-Za-z] are latin, but the rest of ASCII (including digits)
is COMMON, and should be allowed with any script.

-jJ