[Python-Dev] PEP 263 - Defining Python Source Code Encodings

M.-A. Lemburg mal@lemburg.com
Sun, 14 Jul 2002 23:09:12 +0200


Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
>>Of course, we no longer need to convert the tokenizer to
>>work on Py_UNICODE, so the updated text should mention
>>that compile() encodes Unicode input to UTF-8 to the continue
>>with the usual processing.
> 
> 
> The PEP currently does not say that.

I know, it should be updated to the solution found by
Hisao.

>>>2. convert to byte string using "utf-8" encoding,
>>
> [...]
> 
>>Option 2. 
> 
> 
> I think this contradicts the current wording of the PEP. It says
> 
> "5. ... and creating string objects from the Unicode literal data by
> first reencoding the UTF-8 data into 8-bit string data using the given
> file encoding"
> 
> The phrasing "the given file encoding" is a bit lax, but given the
> string
> 
> u"""
> # -*- coding: iso-8859-1 -*-
> s = 'some latin-1 text'
> """
> 
> I would expect that the encoding "given" is iso-8859-1, not utf-8.
> Now, I interpret your message to mean that s will be encoded in
> utf-8. Correct?

Hmm, good point. 8-bit string literals will have to be reencoded
using the encoding stated in the coding comment... skipping that
comment for Unicode argument to compile() would break this.

> If so, I think Fredrik is right, and
> 
>   compile(unicode(script, extract_encoding(script)))
> 
> does indeed something different than
> 
>   compile(script)
> 
> as the latter would give the string value assigned to s in its
> original encoding, i.e. latin-1.

Right. We don't want that.

compile(unicode(script, extract_encoding(script)))
should be the same as
compile(script)

>>Ideal would be to have the tokenizer skip the encoding declaration
>>detection and start directly with the UTF-8 string 
> 
> 
> "skip the encoding declaration" can't really work; you have to parse
> the source code line by line. You can tell the implementation to
> ignore the encoding declaration, if desired.

No, this wouldn't be right. I withdraw that comment :-)

>>(this also solves the problems you'd run into in case the Unicode
>>source code has a source code encoding comment).
> 
> 
> Well, that is precisely the issue that I'm trying to address here. I
> still believe that the resulting behaviour is not specified in the PEP
> at the moment (which is no big deal, since the current implementation
> does not touch compile() at all).

I'll try to come up with a proper wording tomorrow.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/