[Python-Dev] forwarded message from Stephen J. Turnbull
Stephen J. Turnbull
stephen@xemacs.org
04 Mar 2002 12:57:48 +0900
>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:
Martin> I'm not sure whether he still has his original position
Martin> "do not allow multiple source encodings to enter the
Martin> language", which, to me, translates into "source encodings
Martin> are always UTF-8".
Yes, it is. I feel that it is possible to support the users who want
to use national encodings AND define the language in terms of a single
coded character set, as long as that set is Unicode. The usual
considerations of file system safety and standard C library
compatibility dictate that the transformation format be UTF-8. (Below
I will just write "UTF-8" as is commonly done.)
My belief is that the proposal below has the same effect on most users
most of the time as PEP 263, while not committing Python to indefinite
support of a subsystem that will certainly be obsolete for new code in
5 years, and most likely within 2 (at least for people using open
source and major vendor tools, I don't know what legacy editors people
may be using on "big iron" and whatnot).
Martin> If that is the route to take, PEP 263 should be REJECTED,
Martin> in favour of only tightening the Python language
Martin> definition to only allow UTF-8 source code.
I think so.
Martin> For Python, it probably would have to go to the second
Martin> line, with the rationale given in the Emacs manual: the
Martin> first line is often used for #!.
Precisely.
I do not have time or the background to do a formal counter-PEP for
several weeks (likely late April), since I'd have to do a fair amount
of research into both Python internals and PEP procedure. I'd be
happy to consult if someone who does know those wants to take it and
run with it.
Here's the bones:
1. a. Python source code is encoded in UTF-8, with the full UTF-32
character set. The parser proper will reject as corrupted
anything that doesn't have the characteristic UTF-8 leading-byte
trailing-bytes signature.
b. Some provision must be made for systematic handling of private
characters. Ie, there should be a possibility to register and be
dynamically a block from private space. You also need to be able
to request a specific block and move blocks, because many vendors
(Apple and Microsoft spring immediately to mind) allow their apps
to use fixed blocks in private space for vendor character sets.
At this stage it suffices to simply advise that any fixed use of
the private space is likely to conflict with future standards for
sharing that space.
c. This proposal takes no stand on the use of non-ASCII in
keywords and identifiers.
Accomodation of existing usage:
2. Python is a scripting language already in widespread use with
ambitions of longevity; provision must be made for quick hacks and
legacy code. This will be done via a preprocessing hook and
(possibly) i/o hooks.
The preprocessing hook is a filter which is run to transform the
source code buffer on input. It is the first thing done. Python
(the language) will never put anything on that hook; any code that
requires a non-null hook to function is not "true" Python. Thus
there need be no specification for the hook[1]; anything the user
puts on the hook is part of their environment. The preprocessing
hook can be disabled via a command line switch and possibly an
environment variable (it might even make sense for the hook
function to be named in an environment variable, in which case a
null value would disable it).
The intended use is a codec to be run on the source buffer to
convert to UTF-8.
3. The I/O hooks would be analogous, although you run into the usual
problems that many I/O channels obey much less stringent
consistency conditions than files, and in general need not be
rewindable. A similar hook would presumably be desirable for
primitive functions that "eval" strings.
4. It probably won't be possible to simply plug in existing codecs
without specifying the hook too precisely. Therefore Python
should provide a library of codec wrappers for hanging on the
hook.
5. Users who wish to use non-UTF-8 encodings are strongly advised to
use the "coding-cookie-in-comment at top of file" convention. This
convention is already supported by at least GNU Emacs and XEmacs
(modulo XEmacs's "first line only bug") and should be easily
implemented in other editors, including IDLE. To encourage this,
the library mentioned in 4 should provide an "autorecognition"
codec with at least the features that (1) it recognizes and acts
on coding cookies, with good, verbose error messages if
"corruption" is detected; (2) it recognizes and acts on the UTF-8
BOM, with "good" error messages; and (3) otherwise it defaults to
UTF-8, with "good" error messages.
This would allow the "naked" interpreter to just give a terse
"that ain't UTF-8" message. The "naked" interpreter might want to
error on a coding cookie. I think a coding cookie of "utf-8"
should probably be considered an error, as it indicates that the
user doesn't know the language spec.<wink> It might be desirable
to extend feature (2) to other Unicode BOMs.
Experience with Emacs Mule suggests that "smart" autorecognition
(eg of ISO 2022 versions) is not something that Python should
support as a standard feature, although the preprocessor hook
would allow optional modules for this purpose to added easily.
Another "smart" hook function might make assumptions based on
POSIX locale, etc.
6. Some provision will probably need to be made for strings.
Ordinary strings might need to be converted to Unicode or not,
depending on how non-UTF-8 I/O channels are supported. So the
"codec wrappers" mentioned in 2, 3, 4, and 5 would probably need
to understand Python string syntax, and it might be useful to have
a "newt string" type. A "newt string" would _always_ be protected
from conversion to Unicode (and would have a minimal API to force
programmers to not use them 'til "it got bettah").
Unicode strings would be exactly that, and legacy strings would
have semantics depending on the stage of the transition to
Unicode-only, and possibly the user's environment.
Footnotes:
[1] Well, you could try to make that stick.
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Don't ask how you can "do" free software business;
ask what your business can "do for" free software.