[Python-Dev] PEP 263 - Defining Python Source Code Encodings

14 Jul 2002 10:02:15 +0200

"Fredrik Lundh" <fredrik@pythonware.com> writes:

> hmm.  I'm tempted to think that there's a major
> flaw in the PEP, caused by the fact that
> 
>     compile(unicode(script, extract_encoding(script)))
> 
> will, from what I can tell, not compile to the same
> thing as:
> 
>     compile(script)

Can you elaborate what you think the difference is? I believe the PEP
is silent on this specific aspect, but I think what should happen is
(in the Unicode case):

- compile will convert the script to UTF-8, which is then tokenized.
- in the process of parsing, the encoding declaration (that presumably
  extract_encoding was looking at as well) is recognized, if any.
- Unicode literals are left as-is; byte string literals are converted
  back to the original encoding.

So if there is an encoding declaration in script, then I cannot see a
difference. If there is none, the PEP does not elaborate what should
happen. Leaving the byte strings as UTF-8 seems safest, since the only
way to get "correct" non-ASCII strings without the encoding comment is
to use the UTF-8 signature.

In any case, this can't cause backwards compatibility
problems. compile accepts Unicode strings today only if they can be
converted to a byte string. In the standard installation, this will
fail today if there is non-ASCII in script. So allowing Unicode in
compile is a pure extension. If its precise meaning is underspecified,
it should be clarified before stage 2 is implemented.

Regards,
Martin