[Python-Dev] PEP 263 -- Python Source Code Encoding
M.-A. Lemburg
mal@lemburg.com
Wed, 27 Feb 2002 10:16:18 +0100
"Martin v. Loewis" wrote:
>
> Guido van Rossum <guido@python.org> writes:
>
> > > This makes Latin-1 the right choice:
> > >
> > > * Unicode literals already use it today
> >
> > But they shouldn't, IMO.
>
> I agree. I recommend to deprecate this feature, and raise a
> DeprecationWarning if a Unicode literal contains non-ASCII characters
> but no encoding has been declared.
>
> > Sorry, I don't understand what you're trying to say here. Can you
> > explain this with an example? Why can't we require any program
> > encoded in more than pure ASCII to have an encoding magic comment? I
> > guess I don't understand why you mean by "raw binary".
>
> With the proposed implementation, the encoding declaration is only
> used for Unicode literals. In all other places where non-ASCII
> characters can occur (comments, string literals), those characters are
> treated as "bytes", i.e. it is not verified that these bytes are
> meaningful under the declared encoding.
>
> Marc's original proposal was to apply the declared encoding to the
> complete source code, but I objected claiming that it would make the
> tokenizer changes more complex, and the resulting tokenizer likely
> significantly slower (atleast if you use the codecs API to perform the
> decoding).
I don't think that the codecs will significantly slow down
overall compilation -- the compiler is not fast to begin
with.
However, changing the bsae type in the tokenizer and compiler
from char* to Py_UNICODE* will be a significant effort and
that's why I added two phases to the implementation.
The first phase will only touch Unicode literals as proposed by Martin.
> In phase 2, the encoding will apply to all strings. So it will not be
> possible to put arbitrary byte sequences in a string literal, atleast
> if the encoding disallows certain byte sequences (like UTF-8, or
> ASCII). Since this is currently possible, we have a backwards
> compatibility problem.
Right and I believe that a lot of people in European
countries write strings literals with a Latin-1 encoding
in mind. We cannot simply break all that code.
The other problem is with comments found in Python source
code. In phase 2 these will break as well.
So how about this:
In phase 1, the tokenizer checks the *complete file* for
non-ASCII characters and outputs single warning
per file if it doesn't find a coding declaration at
the top. Unicode literals continue to use [raw-]unicode-escape
as codec.
In phase 2, we enforce ASCII as default encoding, i.e.
the warning will turn into an error. The [raw-]unicode-escape
codec will be extended to also support converting Unicode
to Unicode, that is, only handle escape sequences in this
case.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/