[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Stephen J. Turnbull stephen at xemacs.org
Wed Jun 6 05:01:10 CEST 2007


Rauli Ruohonen writes:

 > On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > > I'd love to get rid of full-width ASCII and halfwidth kana (via
 > > compatibility decomposition).
 > 
 > If you do forbid compatibility characters in identifiers, then they
 > should be flagged as an error, not converted silently.

No.  The point is that people want to use their current tools; they
may not be able to easily specify normalization.  We should provide
tools to pick this lint from programs, but the normalization should be
done inside of Python, not by the user.

Please look through the list (I've already done so; I'm speaking from
detailed examination of the data) and state what compatibility
characters you want to keep.

On reflection, I would make an exception for LATIN L WITH MIDDLE DOT
(both cases); just don't decompose it for the sake of Catalan.  (And
there possibly should be a warning for L followed by MIDDLE DOT.)  But
as a native English speaker and one who lectures and deals with the
bureaucracy in Japanese, I can tell you unequivocally I want the fi
and ffi ligatures and full-width ASCII compatibility decomposed, and
as a daily user of several Japanese input methods, I can tell you it
would be a massive pain in the ass if Python doesn't convert those,
and errors would be an on-the-minute-every-minute annoyance.

 > Unicode, and adding extra equivalences (whether it's "FoO" == "foo",
 > "〓〓" == 
-------------- next part --------------
"カキ" or "A123" == "A123") is surprising.

How many Japanese documents do you deal with on a daily basis?  I live
with the half-width kana and full-width ASCII every day, and they are
simply an annoyance to me and to everybody I know.  They are treated
as font variants, not different characters, by *all* users.  Users are
quite happy to substitute ultra-wide ASCII fonts for JIS X 0208 ASCII,
or ultra-condensed fonts for JIS X 0201 kana.

Japanese don't expect equivalence, but that's because it's too much
effort for the programmers when nobody is asking for it; the users are
unsophisticated and don't demand it.  But where equivalence is
provided on web forms and the like, people are indeed surprised, they
are *impressed*.  "Wow!  Gaijin magic!  How'd he *do* that?!"  They
*hate* the fact that some forms want the postal code entered in JIS X
0208 full-width digits while others want ASCII (and I've even seen a
form that expected the address, including the yuubin mark, to be in
full-width JIS, but the postal code itself, embedded in the address,
had to be entered in ASCII or the form couldn't parse it).

 > In short, I would like this function to return 'OK' or be a
 > syntax error, but it should not fail or return something else:
 > 
 > def test():
 >     if 'A' == 'A': return 'OK'
 >     A = 'O'
 >     A = 'K' # as tested above, 'A' and 'A' are not the same thing
 >     return locals()['A']+locals()['A']

I would like this code to return "KK".  This might be an unpleasant
surprise, once, and there would need to be a warning on the box for
distribution in Japan (and other cultures with compatibility
decompositions).

On the other hand, diffusion of non-ASCII identifiers at best will be
moderately paced; people will have to learn about usage and will have
time to get used to it.



More information about the Python-3000 mailing list