[I18n-sig] Strawman Proposal: Encoding Declaration V2

Paul Prescod paulp@ActiveState.com
Sat, 10 Feb 2001 07:58:22 -0800


The encoding declaration controls the interpretation of non-ASCII bytes
in the Python source file. The declaration manages the mapping of
non-ASCII byte strings into Unicode characters.

A source file with an encoding declaration must only use non-ASCII bytes
in places that can legally support Unicode characters. In Python 2.x the
only place is within a Unicode literal. This restriction may be lifted
in future versions of Python.

In Python 2.x, the initial parsing of a Python script is done in terms
of the file's byte values. Therefore it is not legal to use any byte
sequence that has a byte that would be interpreted as a special
character (e.g. quote character or backslash) according to the ASCII
character set. This restriction may be lifted in future versions of
Python.

The encoding declaration must be found before the first statement in the
source file. The declaration is not a pragma. It does not show up in the
parse tree and has no semantic meaning for the compiler itself. It is
conceptually handled in a pre-compile "encoding sniffing" step. This
step is also done using the ASCII encoding. 

The encoding declaration has the following basic syntax:

#?encoding="<some string>"

<some string> is the encoding name and must be associated with a
registered codec. The codec is used to interpret non-ASCII byte
sequences.

The encoding declaration should be present in all Python source files
containing non-ASCII bytes. Some future version of Python may make this
an absolute requirement.

 Paul Prescod