[Python-Dev] PEP: Defining Python Source Code Encodings

M.-A. Lemburg mal@lemburg.com
Tue, 17 Jul 2001 12:08:58 +0200


After having been through two rounds of comments with the "Unicode
Literal Encoding" pre-PEP, it has turned out that people actually
prefer to go for the full Monty meaning that the PEP should handle
the complete Python source code encoding and not just the encoding
of the Unicode literals (which are currently the only parts in a
Python source code file for which Python assumes a fixed encoding).

Here's a summary of what I've learned from the comments:

1. The complete Python source file should use a single encoding.

2. Handling of escape sequences should continue to work as it does 
   now, but with all possible source code encodings, that is
   standard string literals (both 8-bit and Unicode) are subject to 
   escape sequence expansion while raw string literals only expand
   a very small subset of escape sequences.

3. Python's tokenizer/compiler combo will need to be updated to
   work as follows:

   1. read the file
   2. decode it into Unicode assuming a fixed per-file encoding
   3. tokenize the Unicode content
   4. compile it, creating Unicode objects from the given Unicode data
      and creating string objects from the Unicode literal data
      by first reencoding the Unicode data into 8-bit string data
      using the given file encoding

   To make this backwards compatible, the implementation would have to
   assume Latin-1 as the original file encoding if not given (otherwise,
   binary data currently stored in 8-bit strings wouldn't make the
   roundtrip).

4. The encoding used in a Python source file should be easily
   parseable for en editor; a magic comment at the top of the
   file seems to be what people want to see, so I'll drop the
   directive (PEP 244) requirement in the PEP.

Issues that still need to be resolved:

- how to enable embedding of differently encoded data in Python
  source code (e.g. UTF-8 encoded XML data in a Latin-1
  source file)

- what to do with non-literal data in the source file, e.g.
  variable names and comments:

  * reencode them just as would be done for literals
  * only allow ASCII for certain elements like variable names
  etc.

- which format to use for the magic comment, e.g.

  * Emacs style:

      #!/usr/bin/python
      # -*- encoding = 'utf-8' -*-

  * Via meta-option to the interpreter:

      #!/usr/bin/python --encoding=utf-8

  * Using a special comment format:

      #!/usr/bin/python
      #!encoding = 'utf-8'

Comments are welcome !

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/