[Python-Dev] PEP 263 considered faulty (for some Japanese)

Stephen J. Turnbull stephen@xemacs.org
18 Mar 2002 20:48:49 +0900


>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:

    Martin> That is simply not true. The encoding applies to the
    Martin> entire source code.

    Martin> It is only that it is processed just for Unicode literals,

Would you please unpack this?  As it stands it looks like an oxymoron.

    Martin> and this is a documented deviation of the language
    Martin> implementation from the language spec.

I don't see any need for a deviation of the implementation from the
spec.  Just slurp in the whole file in the specified encoding.  Then
cast the Unicode characters in ordinary literal strings down to
bytesize (my preference, probably with errors on Latin-1<0.5 wink>) or
reencode them (Guido's and your suggestion).  People who don't like
the results in their non-Unicode literal strings (probably few) should
use hex escapes.  Sure, you'll have to rewrite the parser in terms of
UTF-16.  But I thought that was where you were going anyway.

If not, it should be nearly trivial to rewrite the parser in terms of
UTF-8 (since it is a superset of ASCII and non-ASCII is currently only
allowed in comments or guarded by a (Unicode)? string literal AFAIK).
The main issue would be anything that involves counting characters
(not bytes!), I think.  Everything else is a no-op because high-bit-
set octets only occur in whole-character units and in things that
could be considered single tokens (string literals and comments), so
just keep glomming them on the current token until you find any of the
token-ending characters in the current ASCII-based implementation.  No
need to change any syntax.  Transforming the UTF-8 to UTF-16 for
Unicode string literals is fast, easy to implement, and guaranteed
invertible (modulo the UTF-32 vs UCS-4 issue).

The UTF-8 strategy probably gives you identifiers containing arbitrary
characters reliably (that is, as reliable as anything that admits more
than one encoding can be) and nearly for free, in the same way as you
get arbitrary string data and comments.  It's debatable whether that's
a good thing, of course.  (Except for the obfuscators, to whom "it's
all Greek to me" will be music to their ears.)


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.