[Python-3000] Support for PEP 3131

Jim Jewett jimjjewett at gmail.com
Thu May 24 20:14:41 CEST 2007


On 5/24/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Jim Jewett writes:

>  > I would like an alert (and possibly an import exception) on any code
>  > whose *executable portion* is not entirely in ASCII.

> Are you talking about language definition or implementation?  I like
> the idea of such checks, as long as they are not mandatory in the
> language and can be turned off easily at run time in the default
> configuration.  I'd also really like a generalization (described below).

Definition; I don't care whether it is a different argument to import
or a flag or an environment variable or a command-line option, or ...
I just want the decision to accept non-ASCII characters to be
explicit.

Ideally, it would even be explicit per extra character allowed, though
there should obviously be shortcuts to accept entire scripts.

>  > > The only issues PEP 3131 should be concerned with *defining*
>  > > are those that cause problems with canonicalization, and the range of
>  > > characters and languages allowed in the standard library.

Sorry; I missed the "stdlib" part of that sentence when I first
replied.  I agree except that the range of characters/languages
allowed by *python* is also an open issue.

> AFAIK *canonicalization* is also a solved issue (although exactly what
> "NFC" means might change with Unicode errata and of course with future
> addition of combining characters or precombined characters).

Why NFC?

The Tech Reports seem to suggest NFKD -- and that makes a certain
amount of sense.  Using compatibility characters reduces the problem
with equivalent characters that are distinct only for historical
reasons.  Using decomposed characters  simplifies processing.

On the other hand, NFC might often be faster in practice, as it might
not require changes -- but if you don't do the processing to verify
that, then you mess up the hash.

I'm willing to trust the judgment of those with more experience, but
the decision of which form to use should be explicit.

> The notion of "identifier constituent" is a bit thorny.

I think it is even thornier than you do, but I think we may agree on
an acceptable answer.

> Well, what I *really* want is a loadable table.  My motivation is that
> I want organizations to be able to "enforce" a policy that is less
> restrictive than "ASCII-only" but more restrictive than "almost
> anything goes".  My students don't need Sanskrit; Guido's tax
> accountant doesn't need kanji, and neither needs Arabic.  I think that
> they should be able to get the same strict "alert or even import
> exception" (that you want on non-ASCII) for characters outside their
> larger, but still quite restricted, sets.

So how about

(1)  By default, python allows only ASCII.
(2)  Additional characters are permitted if they appear in a table
named on the command line.

These additional characters should be restricted to code points larger
than ASCII (so you can't easily turn "!" into an ID char), but beyond
that, anything goes.  If you want to include punctuation or undefined
characters, so be it.

Presumably, code using Kanji would be fairly easy to run in a Kanji
environment, but code using punctuation or Linear B would ... need to
convince people that there was a valid reason for it.

Note that I think a single table argument is sufficient; I don't see
the point in saying that identifiers can include Japanese Accounting
Numbers, but can't start with them.  (Unless someone is going to
suggest that they be parsed to their numeric value?)

-jJ


More information about the Python-3000 mailing list