[Python-3000] PEP 3131 - the details

Thu May 17 17:22:19 CEST 2007

> While there has been a lot of discussion as to whether to accept PEP 
> 3131 as a whole, there has been little discussion as to the specific 
> details of the PEP. In particular, is it generally agreed that the 
> Unicode character classes listed in the PEP are the ones we want to 
> include in identifiers? My preference is to be conservative in terms of 
> what's allowed.

John Nagle suggested to consider UTR#39
(http://unicode.org/reports/tr39/). I encourage anybody to help me
understand what it says.

The easiest part is 3.1: this seems to say we should restrict characters
listed as "restrict" in [idmod]. My suggestion would be to warn about
them. I'm not sure about the purpose of the additional characters:
surely, they don't think we should support HYPHEN-MINUS in identifiers?

4. Confusable Detection: Without considering details, it seems you need
two strings to decide whether they are confusable. So it's not clear
to me how this could apply to banning certain identifiers.

5. Mixed Script Detection: That might apply, but I can't map the
algorithm to terminology I'm familiar with. What is UScript.COMMON
and UScript.INHERITED? I'm skeptical about mixed-script detection,
because you surely want to allow ASCII digits (0..9) in Cyrillic
identifiers - not sure whether the detection would claim that the
digits are Latin (which they aren't - they are Arabic numbers).
So a precise algorithm in Python (using unicodedata) would be
helpful. I still would like to make that produce a warning only;
users more concerned about phishing could turn the warning into
an error.

Regards,
Martin