[Python-Dev] PEP 263 - Defining Python Source Code Encodings

14 Jul 2002 21:31:27 +0200

"M.-A. Lemburg" <mal@lemburg.com> writes:

> Oh, I thought it would be natural from reading the complete
> text:

It still is not natural from reading the text you quote.

> 
> """
>      2. Change the tokenizer/compiler base string type from char* to
>         Py_UNICODE* and apply the encoding to the complete file.

As you say, this is more conveniently done with UTF-8 char*.

>         Source files which fail to decode cause an error to be raised
>         during compilation.

In the case of Unicode strings passed to compile(), this is
irrelevant; the string is already decoded.

>         The builtin compile() API will be enhanced to accept Unicode as
>         input. 8-bit string input is subject to the standard procedure
>         for encoding detection as decsribed above.
> """

That only says that Unicode strings are processed; it still does not
say how string literals appearing the source code are treated.

> Of course, we no longer need to convert the tokenizer to
> work on Py_UNICODE, so the updated text should mention
> that compile() encodes Unicode input to UTF-8 to the continue
> with the usual processing.

The PEP currently does not say that.

> > 2. convert to byte string using "utf-8" encoding,
[...]
> Option 2. 

I think this contradicts the current wording of the PEP. It says

"5. ... and creating string objects from the Unicode literal data by
first reencoding the UTF-8 data into 8-bit string data using the given
file encoding"

The phrasing "the given file encoding" is a bit lax, but given the
string

u"""
# -*- coding: iso-8859-1 -*-
s = 'some latin-1 text'
"""

I would expect that the encoding "given" is iso-8859-1, not utf-8.
Now, I interpret your message to mean that s will be encoded in
utf-8. Correct?

If so, I think Fredrik is right, and

  compile(unicode(script, extract_encoding(script)))

does indeed something different than

  compile(script)

as the latter would give the string value assigned to s in its
original encoding, i.e. latin-1.

> Ideal would be to have the tokenizer skip the encoding declaration
> detection and start directly with the UTF-8 string 

"skip the encoding declaration" can't really work; you have to parse
the source code line by line. You can tell the implementation to
ignore the encoding declaration, if desired.

> (this also solves the problems you'd run into in case the Unicode
> source code has a source code encoding comment).

Well, that is precisely the issue that I'm trying to address here. I
still believe that the resulting behaviour is not specified in the PEP
at the moment (which is no big deal, since the current implementation
does not touch compile() at all).

Regards,
Martin