[I18n-sig] Strawman Proposal: Encoding Declaration V2

M.-A. Lemburg mal@lemburg.com
Sat, 10 Feb 2001 23:26:10 +0100

Paul Prescod wrote:
> The encoding declaration controls the interpretation of non-ASCII bytes
> in the Python source file. The declaration manages the mapping of
> non-ASCII byte strings into Unicode characters.
> A source file with an encoding declaration must only use non-ASCII bytes
> in places that can legally support Unicode characters. In Python 2.x the
> only place is within a Unicode literal. This restriction may be lifted
> in future versions of Python.
> In Python 2.x, the initial parsing of a Python script is done in terms
> of the file's byte values. Therefore it is not legal to use any byte
> sequence that has a byte that would be interpreted as a special
> character (e.g. quote character or backslash) according to the ASCII
> character set. This restriction may be lifted in future versions of
> Python.
> The encoding declaration must be found before the first statement in the
> source file. The declaration is not a pragma. It does not show up in the
> parse tree and has no semantic meaning for the compiler itself. It is
> conceptually handled in a pre-compile "encoding sniffing" step. This
> step is also done using the ASCII encoding.
> The encoding declaration has the following basic syntax:
> #?encoding="<some string>"
> <some string> is the encoding name and must be associated with a
> registered codec. The codec is used to interpret non-ASCII byte
> sequences.
> The encoding declaration should be present in all Python source files
> containing non-ASCII bytes. Some future version of Python may make this
> an absolute requirement.

Sounds overly complicated to me; even though the resulting semantics
seem to be the same as those which I summarized in the last mail
on the original "Encoding Declaration" thread:

1. programs which do not use the encoding declaration are free
   to use non-ASCII bytes in literals; Unicode literals must
   use Latin-1 (for historic reasons)

2. programs which do make use of the encoding declaration may
   only use non-ASCII bytes in Unicode literals; these are then
   interpreted using the given encoding information and decoded
   into Unicode during the compilation step

Part 1 assures backward compatibility. Part 2 assures that programmers
start to think about where they have to use Unicode and which
program literals are allowed to go into string literals. Part 1
is already implemented, part 2 is easy to do, since only the
compiler will have to be changed (in two places).

If you want to keep your version, please add an explicit section
about 1. to it. Otherwise it will cause unnecessary confusion.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/