[Python-Dev] PEP 263 -- Python Source Code Encoding

M.-A. Lemburg mal@lemburg.com
Wed, 27 Feb 2002 10:16:18 +0100

"Martin v. Loewis" wrote:
> Guido van Rossum <guido@python.org> writes:
> > > This makes Latin-1 the right choice:
> > >
> > > * Unicode literals already use it today
> >
> > But they shouldn't, IMO.
> I agree. I recommend to deprecate this feature, and raise a
> DeprecationWarning if a Unicode literal contains non-ASCII characters
> but no encoding has been declared.
> > Sorry, I don't understand what you're trying to say here.  Can you
> > explain this with an example?  Why can't we require any program
> > encoded in more than pure ASCII to have an encoding magic comment?  I
> > guess I don't understand why you mean by "raw binary".
> With the proposed implementation, the encoding declaration is only
> used for Unicode literals. In all other places where non-ASCII
> characters can occur (comments, string literals), those characters are
> treated as "bytes", i.e. it is not verified that these bytes are
> meaningful under the declared encoding.
> Marc's original proposal was to apply the declared encoding to the
> complete source code, but I objected claiming that it would make the
> tokenizer changes more complex, and the resulting tokenizer likely
> significantly slower (atleast if you use the codecs API to perform the
> decoding).

I don't think that the codecs will significantly slow down
overall compilation -- the compiler is not fast to begin 

However, changing the bsae type in the tokenizer and compiler
from char* to Py_UNICODE* will be a significant effort and
that's why I added two phases to the implementation.

The first phase will only touch Unicode literals as proposed by Martin.
> In phase 2, the encoding will apply to all strings. So it will not be
> possible to put arbitrary byte sequences in a string literal, atleast
> if the encoding disallows certain byte sequences (like UTF-8, or
> ASCII). Since this is currently possible, we have a backwards
> compatibility problem.

Right and I believe that a lot of people in European 
countries write strings literals with a Latin-1 encoding
in mind. We cannot simply break all that code.

The other problem is with comments found in Python source
code. In phase 2 these will break as well.

So how about this:

In phase 1, the tokenizer checks the *complete file* for
non-ASCII characters and outputs single warning 
per file if it doesn't find a coding declaration at
the top. Unicode literals continue to use [raw-]unicode-escape
as codec.

In phase 2, we enforce ASCII as default encoding, i.e.
the warning will turn into an error. The [raw-]unicode-escape
codec will be extended to also support converting Unicode
to Unicode, that is, only handle escape sequences in this

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/