[Python-Dev] PEP 263 - default encoding

15 Mar 2002 21:42:43 +0100

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

>     mal> I have reworded the phase 1 implementation as follows:
> 
>     mal>     1. Implement the magic comment detection, but only apply
>     mal> the detected encoding to Unicode literals in the source file.
> 
> a. Does this really make sense for UTF-16?  It looks to me like a
> great way to induce bugs of the form "write a unicode literal
> containing 0x0A, then translate it to raw form by stripping the u
> prefix."

I'm not sure I understand the question. UTF-16 is not supported as a
source encoding, so no, it does not make sense for it to be applied to
Unicode literals.

> b. No editor is likely to implement correct display to distinguish
> between u"" and just "".

The declared encoding applies to the entire file. In phase 1, Python
does not use that for anything but Unicode literals, though.

Even in phase 2, non-ASCII will be only allowed in comments and string
literals. Comments are ignored by the Python lexer (except for
encoding/tab declarations). For string literals, the meaning of the
literal does not change even if the encoding is considered: the string
literal continues to denote the same sequence of bytes.

The only differences in phase two will be those:

- if there is an encoding violation inside a comment or a string literal,
  Python will reject the source code (simply because decoding fails)

- if the declared encoding uses \ or " as the second bytes of a multi-byte
  encoding, Python will correctly parse the string. In phase 1, it may
  fail to correctly determine the end of the string.

> c. This definitely breaks Emacs coding cookie semantics.  Emacs
> applies the coding cookie to the whole buffer.  

So does Python. It just side-steps part of the code conversions in
phase 1. 

> d. You probably have to deprecate ISO 2022 7-bit coding systems, too,
> because people will try to get the representation of a string by
> inputting a raw string in coded form.  This might contain a quote
> character.

We don't deprecate them; we just don't support them in phase 1. Users
of these encodings are encouraged to contribute a phase 2
implementation.

> e. This causes problems for UTF-8 transition, since people will want
> to put arbitrary byte strings in a raw string.  

No, they won't. Also, if the declared encoding is UTF-8, it is
incorrect to put arbitrary byte strings into a string literal - but
the implementation does not detect this violation.

Regards,
Martin