Multibyte Character Surport for Python

Stephen J. Turnbull stephen at xemacs.org
Thu May 9 01:46:06 EDT 2002


>>>>> "Martin" == Martin v Löwis <loewis at informatik.hu-berlin.de> writes:

    Martin> So in the end, the only acceptable strategy would be to
    Martin> allow identifiers that contain letters (or letterlike
    Martin> symbols) in arbitrary languages. For Python, that would
    Martin> mean that attributes must be Unicode objects, which could
    Martin> cause code breakage.

This would actually be rather simple if you just declare that Python
programs as submitted to the (internal) parser must be in UTF-8, and
ensure that PEP 263 codecs do this in a way transparent to the user.
These codecs would have to differ from ordinary codecs in one way:
they must know about ordinary Python strings (ie, not Unicode), and
must _not_ translate those bytes to Unicode values, but rather pass
them to the parser with their integral values unchanged but UTF-8-
encoded.

The parser would need to know what to do with ordinary strings (ie,
decode UTF-8 encoded octets back to "raw" form) and Unicode strings
(ie, transform from UTF-8 to UTF-16).

Then you just lift the restriction that identifier names must consist
of bytes < 128.  A second stage would be to restrict identifier names
from containing non-ASCII punctuation, whitespace, and other special
characters, but this doesn't bother Python, only human readers (unless
you want to extend Python syntax to include some non-ASCII symbols as
reserved words).

I see no reason why this would cause code breakage, although I haven't
tried it yet.  It would break debugging for people who abuse ordinary
strings to contain externally-encoded text, as they would be unable to
view their print'ed strings (externally encoded) in the same console
as error messages referring to identifiers (UTF-8).  But I think
that's a GoodThang<0.1 wink>.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py



More information about the Python-list mailing list