[Python-Dev] PEP 263 - Defining Python Source Code Encodings
Martin v. Loewis
martin@v.loewis.de
14 Jul 2002 10:02:15 +0200
"Fredrik Lundh" <fredrik@pythonware.com> writes:
> hmm. I'm tempted to think that there's a major
> flaw in the PEP, caused by the fact that
>
> compile(unicode(script, extract_encoding(script)))
>
> will, from what I can tell, not compile to the same
> thing as:
>
> compile(script)
Can you elaborate what you think the difference is? I believe the PEP
is silent on this specific aspect, but I think what should happen is
(in the Unicode case):
- compile will convert the script to UTF-8, which is then tokenized.
- in the process of parsing, the encoding declaration (that presumably
extract_encoding was looking at as well) is recognized, if any.
- Unicode literals are left as-is; byte string literals are converted
back to the original encoding.
So if there is an encoding declaration in script, then I cannot see a
difference. If there is none, the PEP does not elaborate what should
happen. Leaving the byte strings as UTF-8 seems safest, since the only
way to get "correct" non-ASCII strings without the encoding comment is
to use the UTF-8 signature.
In any case, this can't cause backwards compatibility
problems. compile accepts Unicode strings today only if they can be
converted to a byte string. In the standard installation, this will
fail today if there is non-ASCII in script. So allowing Unicode in
compile is a pure extension. If its precise meaning is underspecified,
it should be clarified before stage 2 is implemented.
Regards,
Martin