[Python-Dev] PEP 263 -- Python Source Code Encoding

M.-A. Lemburg mal@lemburg.com
Wed, 27 Feb 2002 11:07:15 +0100

Tim Peters wrote:
> [M.-A. Lemburg]
> > Jack had the same question. The simple answer is: we need this
> > in order to maintain backward compatibility when we move to
> > phase two of the implementation.
> >
> > Here's the longer one:
> >
> > ASCII is the standard encoding for Python keywords and identifiers.
> > There is no standard source code encoding for string literals.
> But there is:
>     Python uses the 7-bit ASCII character set for program text and
>     string literals.  8-bit characters may be used in string literals
>     and comments but their interpretation is platform dependent; the
>     proper way to insert 8-bit characters in string literals is by
>     using octal or hexadecimal escape sequences.
> The Ref Man has said "7-bit ASCII" for both "program text and string
> literals" for a long time.  The formal grammar in the Ref Man agrees with
> this (including the formal grammar for Unicode literals).  It's an
> historical accident that the tokenizer happened to use C isalpha() to
> "enforce" this for identifiers, and that C isalpha() happened to grow
> locale-dependence while Guido was too drunk with power to notice <wink>.

It's a fact of life that users don't read reference manuals,
but simply write programs and feel good if they happen to
work :-)

As a result, programs have used string literals in many different
encodings for a long time. Changing this situation will take 
time. The proposal aims at clarifying the situation and to
make the transition less painful.

> > Unicode literals are interpreted using 'unicode-escape' which
> > is an enhanced Latin-1 with escape semantics.
> I'm sure they *do* "act like" Latin-1 on your box, and that identifiers also
> act like Latin-1 was in effect on your box.  But the Ref Man explicitly says
> all that is platform dependent; there's no "backward compatibility" to
> preserve here beyond 7-bit ASCII unless you want to preserve that Python
> always rely on what C isalpha() says.

You tell that to the Russians, Japanese or the Europeans 
writing Python programs -- it just happens that comments and
literals are bound to end up using local encodings.

Anyway, with the PEP implemented we'll no longer have to
restrict ourselves to 7-bit US-ASCII, so all these problems
will go away.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/