[Python-Dev] PEP 263 - default encoding

Martin v. Loewis martin@v.loewis.de
16 Mar 2002 10:10:56 +0100


Guido van Rossum <guido@python.org> writes:

> But the treatment of k under phase 2 will be, um, interesting, and I'm
> not sure what it should do!!!  Since in phase 2 the entire file will
> be decoded from KOI8-R to Unicode before it's parsed, maybe the best
> thing would be to encode 8-bit string literals back using KOI8-R (in
> general, the encoding given in the encoding cookie).

The meaning of the string literals will not change: they continue to
denote byte strings, and they continue to denote the same byte strings
that they denote today (by accident).

What will change is this:
- it will be official that you can put KOI-8R into a string literal,
  and that the interpreter will produce the byte string "as-is"
- it will be an error if the byte string does not follow the encoding,
  e.g. if you declare UTF-8, but have some string literal that violates
  the UTF-8 structure
- Python will determine token boundaries only after decoding the input,
  so a byte value of 34 does not necessarily indicate the end of a
  string anymore (if the decoder consumes the byte as the second byte
  of some character)

In general, the implementation strategy will be indeed that strings
literals are encoded back into their original encoding. It is not
clear to me when this should happen, though; in particular, whether
the AST should have Py_UNICODE* everywhere.

Regards,
Martin