[Python-3000] Support for PEP 3131

Stephen J. Turnbull stephen at xemacs.org
Sun May 27 16:03:59 CEST 2007

Jim Jewett writes:

 > > Cf characters?  Are we admitting "stupid bidi tricks", too?<wink>
 > If Tomer needs them.

But that's what I mean by respecting the work of the Unicode technical
committees.  They say he *doesn't* need them, no matter what he thinks.

They do make mistakes.  But they are far less likely to make mistakes
than a non-specialist native speaker.

 > Seriously, I wouldn't put Cf characters in the default accepted
 > tabled.  (But remember that *I* would limit that default to ASCII.)

It's not the default that matters.  It's what actually gets used that
matters.  If we start by saying "you can't have these characters" and
the users thumb their noses at us, OK, we made a mistake and we fix
it to correspond to what the users actually have shown to be BCP.

If we start by saying "you can have any characters you want", I'm
pretty sure we're making a mistake, and if so, we can't fix it any
more than we can get rid of Reply-To munging.

 > Agreed; but in my opinion, the decision to allow those characters is
 > local; the decision to rescind them would therefore also be local.

It is not a local decision, not in PEP 3131.  PEP 3131 clearly intends
to conform to UAX #31.  (I think it still needs to *explicitly* state
that it's defining a profile of UAX #31, since there are restrictions
on ASCII identifier characters in Python that are not in the basic
definitions of UAX #31.)  Your proposal would return PEP 3131 to a
blank sheet of paper, and ensure non-conformance with an important
normative Annex of Unicode.

 > I had been thinking of the unicode version as a feature that didn't
 > change within a python release.  Perhaps that is negotiable?

I think it's a bad idea to allow it to change within a release.  All I
meant was that there could be a well-known mechanism for using
different tables, either at run-time or at compile-time, so that users
could change it if they want to.

People who need Lepcha and Cham and want to have a Python that uses
unapproved code points for them will have to use a Python which is not
conformant.  Let them, of course, but I don't see why the 6 billion
potential Python users who have never heard of Lepcha, Cham, or the
"IBM corporate extension character set for Japanese" should need to
forego Unicode conformance as well.

 > > Maybe the way to handle this is to allow private-space characters in
 > > identifiers as an option.  That would be doable with your well-known
 > > file scheme.  But it's very dangerous across modules.
 > It turns out that page was out of date; Lepcha and Cham now have code
 > points which haven't been formally approved, but aren't likely to
 > change.  Officially, they're still undefined, but using private-space
 > probably isn't the right answer.  So either we allow these particular
 > "undefined" characters, or we (for now) disallow Lepcha and Cham.

The law of the excluded middle doesn't apply in that way.  It's
trivial to "cast" the unofficial code points into "private space" as a
block.  This technique was used in XEmacs/CHISE (nee XEmacs/UTF-2000)
to grandfather the old MULE codes while they filled out the Unicode
space, and to map character sets that are not Unicode conformant into
Unicode space while preserving collating order and so on.

Granted, that's a research extension not a production editor, but the
technique seems to work pretty well for the people who need such
things.  Any Python code that doesn't assume a numerical relationship
between the Lepcha block and any other block will work unchanged, and
implementing the changeover for old versions of Python that don't know
about Lepcha simply requires installing a Lepcha compatibility codec
to do the trivial mapping.  Is that cool or what?

The main problem with this technique is that on some platforms you
have to be careful about casting into the BMP, because vendors like
Microsoft and Apple have a penchant for using a lot of the BMP private
space for corporate logos and the like.  And I think Klingon is
standard on Linux (or has the Unicode consortium approved a Klingon
block since I last looked?)

More information about the Python-3000 mailing list