[Python-3000] Support for PEP 3131

Jim Jewett jimjjewett at gmail.com
Mon Jun 11 15:58:58 CEST 2007


On 6/10/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> >> Indeed, PEP 3131 gives a predictable identifier character set.
> >> Adding per-site options to change the set of allowable characters
> >> makes it less predictable.

> > Not in practice.
...

> > By allowing site modifications, the rule becomes:

> > It will use ASCII.

[and clipped "programs intended only for local use will use ASCII plus
letters that locla users  recognize."]

> Not universally - only on that site.

Yes, universally.  By allowing "any unicode character", you have
reason to believe the next piece of code isn't doing something
strange, either by accident or by malice.

By allowing "ASCII + those listed in the site config", then the rule
will change from

    "It will use ASCII, always" (today)
to
    "It will use ASCII if it is intended for distribution."
plus
    "local programs can use ASCII + locally recognized letters"

That is slightly more complicated than ASCII-only, but only for those
who want to use the extended charsets -- and either rule is still
straightforward.

The rule proposed in PEP 3131 is

    "It will use something that is numerically a letter or number, to
someone somewhere."

Given the style guide of ASCII for internationally targeted open
source, that will degrade to

    "It should use ASCII".
    "But it might not, since there will be no feedback or apparent
downside to violating the style rule, even for distributed code."
    "In fact, it might even use something downright misleading, and
you won't have any warning, because we thought that maybe someone,
somewhere, might have wanted that character in a different context."

And no, I don't think I'm exagerating with that last one; we aren't
proposing rules against mixed script identifiers (or even limiting
script switches to occur only at the _ character).  It will be
perfectly legitimate to apparently end a string with three consecutive
prime characters.  It will be bad style, but there will be nothing to
tip off the non-paranoid.

In theory, we could solve this by limiting the non-ASCII characters,
but I don't we can do that in practice.  The unicode consortium hasn't
even tried; even XID + security modifications + NFKC still includes
characters that are intended to look identical; all the security
modifications do is eliminate characters that do *not* have any
expected legitimate use.  (Example:  no living language uses them.)

I don't think we want to wade too deeply into the morass of
confusables detection; the unicode consortium itself says the problem
is neither solved nor stable.

It might be a good idea to restrict (wihtin-a-single-ID) script
switches to only occur at the "_", but I'm not sure a 95% solution is
worth doing.

By saying "Only charcacters you or your sysadmin expected", we at
least limit it to things the user will be expecting and can recognize.
 (Unless the sysadmin decides otherwise.)

> I don't know what rule is
> in force on my buddy's machine, so predicting it becomes harder.

But you know ASCII will work.

If he used the same local install (classroom peer, member of the same
user group, etc), then your local characters will probably work too.

If he is really your buddy, he probably trusts you enough to allow
your charset if you tell him about it.

> I just put wording in the PEP that makes it clear that, whatever
> the problem, a global flag is not an acceptable solution.

I agree that a single flag doesn't really solve the problem.  But a
global configuration does go a long way.

For me personally, I would be more willing to allow Latin-1 than
Hangul, because I can recognize the Latin-1 characters.  (I still
wouldn't allow them all by default; the difference between the various
lower-case i's is small enough -- to me -- that I want a warning when
one is used.)  Hangul is more acceptable than Cyrillic, because at
least it is obviously foreign; I won't mistake it for something.

Someone who uses Cyrillic on a daily basis might well have the
opposite preferences.  I support letting her use Cyrillic if she wants
to; I just don't want it to work on my machine without my knowing
about it.  But I would like to be able to accept é and ç (French
characters) without shutting off the warning for Cyrillic or Ogham.

Allowing ASCII plus "chars specified by the site or user through a
config file" meets that goal.

-jJ


More information about the Python-3000 mailing list