Multibyte Character Surport for Python
Stephen J. Turnbull
stephen at xemacs.org
Thu May 9 01:46:06 EDT 2002
>>>>> "Martin" == Martin v Löwis <loewis at informatik.hu-berlin.de> writes:
Martin> So in the end, the only acceptable strategy would be to
Martin> allow identifiers that contain letters (or letterlike
Martin> symbols) in arbitrary languages. For Python, that would
Martin> mean that attributes must be Unicode objects, which could
Martin> cause code breakage.
This would actually be rather simple if you just declare that Python
programs as submitted to the (internal) parser must be in UTF-8, and
ensure that PEP 263 codecs do this in a way transparent to the user.
These codecs would have to differ from ordinary codecs in one way:
they must know about ordinary Python strings (ie, not Unicode), and
must _not_ translate those bytes to Unicode values, but rather pass
them to the parser with their integral values unchanged but UTF-8-
encoded.
The parser would need to know what to do with ordinary strings (ie,
decode UTF-8 encoded octets back to "raw" form) and Unicode strings
(ie, transform from UTF-8 to UTF-16).
Then you just lift the restriction that identifier names must consist
of bytes < 128. A second stage would be to restrict identifier names
from containing non-ASCII punctuation, whitespace, and other special
characters, but this doesn't bother Python, only human readers (unless
you want to extend Python syntax to include some non-ASCII symbols as
reserved words).
I see no reason why this would cause code breakage, although I haven't
tried it yet. It would break debugging for people who abuse ordinary
strings to contain externally-encoded text, as they would be unable to
view their print'ed strings (externally encoded) in the same console
as error messages referring to identifiers (UTF-8). But I think
that's a GoodThang<0.1 wink>.
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
My nostalgia for Icon makes me forget about any of the bad things. I don't
have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py
More information about the Python-list
mailing list