[Python-Dev] PEP 263 - Defining Python Source Code Encodings
M.-A. Lemburg
mal@lemburg.com
Sun, 14 Jul 2002 23:09:12 +0200
Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
>>Of course, we no longer need to convert the tokenizer to
>>work on Py_UNICODE, so the updated text should mention
>>that compile() encodes Unicode input to UTF-8 to the continue
>>with the usual processing.
>
>
> The PEP currently does not say that.
I know, it should be updated to the solution found by
Hisao.
>>>2. convert to byte string using "utf-8" encoding,
>>
> [...]
>
>>Option 2.
>
>
> I think this contradicts the current wording of the PEP. It says
>
> "5. ... and creating string objects from the Unicode literal data by
> first reencoding the UTF-8 data into 8-bit string data using the given
> file encoding"
>
> The phrasing "the given file encoding" is a bit lax, but given the
> string
>
> u"""
> # -*- coding: iso-8859-1 -*-
> s = 'some latin-1 text'
> """
>
> I would expect that the encoding "given" is iso-8859-1, not utf-8.
> Now, I interpret your message to mean that s will be encoded in
> utf-8. Correct?
Hmm, good point. 8-bit string literals will have to be reencoded
using the encoding stated in the coding comment... skipping that
comment for Unicode argument to compile() would break this.
> If so, I think Fredrik is right, and
>
> compile(unicode(script, extract_encoding(script)))
>
> does indeed something different than
>
> compile(script)
>
> as the latter would give the string value assigned to s in its
> original encoding, i.e. latin-1.
Right. We don't want that.
compile(unicode(script, extract_encoding(script)))
should be the same as
compile(script)
>>Ideal would be to have the tokenizer skip the encoding declaration
>>detection and start directly with the UTF-8 string
>
>
> "skip the encoding declaration" can't really work; you have to parse
> the source code line by line. You can tell the implementation to
> ignore the encoding declaration, if desired.
No, this wouldn't be right. I withdraw that comment :-)
>>(this also solves the problems you'd run into in case the Unicode
>>source code has a source code encoding comment).
>
>
> Well, that is precisely the issue that I'm trying to address here. I
> still believe that the resulting behaviour is not specified in the PEP
> at the moment (which is no big deal, since the current implementation
> does not touch compile() at all).
I'll try to come up with a proper wording tomorrow.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/