[Python-3000] Support for PEP 3131

Stephen J. Turnbull stephen at xemacs.org
Mon May 28 05:51:46 CEST 2007


Collin Winter writes:

 > Sincere question: if these characters aren't needed, why are they
 > provided? From what I can tell by googling, they're needed when, e.g.,
 > Arabic is embedded in an otherwise left-to-right script. Do I have
 > that right? That sounds pretty close to what you'd get when using
 > Arabic identifiers with the English keywords/stdlib.

The problem is visual presentation to humans.  It's very much like
unmarshalling little-endian integers from a byte stream.  The byte
stream by definition is big-endian, so when you simply memcpy into the
stream buffer, little-endian integers will come out in reverse byte
order.  Bidi works a little bit differently; in principle it works
both ways (if you start LTR then the RTL is in reverse order in the
stream, and vice versa) since both kinds of script are character
streams.  But in both cases, *inside* the computer, there is a natural
"big-endian" order and the computer does not get confused.  That is
one sense in which format characters are YAGNIs.

Now, identifiers are by definition character streams.  If an English
speaker would pronounce the spelling of an English word "A B C", and
an Arabic speaker an Arabic word as "1 2 3", then *as an identifier*
the combination English then Arabic is spelled "A B C _ 1 2 3".  And
that's all the Python compiler needs to know.  In fact, on the editor
display this would be presented "ABC_321".  In data entry, you'd see
something like this

key     display
 A      A
 B      AB
 C      ABC
 _      ABC_
 1      ABC_1
 2      ABC_21
 3      ABC_321

This can be done algorithmically (this is the "Unicode Technical Annex
#9", aka "UAX #9", you may have seen references to), to a very high
degree approximation to what human typesetters do in bidi cultures.

Now suppose you want to see on screen the contents of memory cells as
characters.  Then you would put into memory something like "A B C _
LRO 1 2 3" where LRO is a control character that says "no matter what
directional property has normally, override that with left-to-right
until I say otherwise."  That logical sequence of characters is indeed
displayed "ABC_123".

But how about those as identifiers?  Note that in memory the sequence
of printing characters is "A B C _ 1 2 3" in each case.  So it makes
sense to think of that as the identifier, *ignoring* the presentation
control characters.

Suppose we prohibit the directional control characters.  Then a
Unicode conforming editor will put the characters in logical order "A
B C _ 1 2 3" in the file, and display them naturally (to a speaker of
Arabic) as "ABC_321".  This is going to be by far the most common
case, and the user knows that it works this way.  I don't see a
problem here.  Do you?

OK, now let's consider the cases of breakage.  Consider a malicious
author who uses LRO as "A B C _ 1 2 LRO 3" which displays as "ABC_213"
(IIRC, I haven't actually tried to implement bidi in a very long
time).  Can you think of a genuine use for that?  I can't; I think
it's a bad idea to allow it.

On the other hand, you could have a situation where the printed
documentation uses the UAX #9 bidi algorithm, and discusses the
meaning of the identifier "ABC_321", while the reviewing programmer is
using a broken editor which implements overrides but not the
algorithm, and sees "ABC_123".  So in the case where LRO is permitted,
the author can enforce the visual order that the reviewer will see in
the documents on both the documents and the editor display.  But since
it's the unnatural (to an Arabic reader) "ABC_123", it will be
confusing and hard to read.  Is this a win?

As somebody (I think Jim J) pointed out, bidi is a world of pain
unless and until *all* editors and readers implement a common set of
display conventions.  Python can't do anything that will unambiguously
reduce that pain.  So IMHO it is best to conform to a standard that
can be unambiguously implemented, and is likely to be available to the
majority of programmers who need to work with bidi environments.  That
is UAX #31, which mandates ignoring these format characters (in the
default profile), and strongly recommends prohibiting them in all
profiles.


More information about the Python-3000 mailing list