[Python-Dev] PEP: Defining Python Source Code Encodings
Tue, 17 Jul 2001 12:08:58 +0200
After having been through two rounds of comments with the "Unicode
Literal Encoding" pre-PEP, it has turned out that people actually
prefer to go for the full Monty meaning that the PEP should handle
the complete Python source code encoding and not just the encoding
of the Unicode literals (which are currently the only parts in a
Python source code file for which Python assumes a fixed encoding).
Here's a summary of what I've learned from the comments:
1. The complete Python source file should use a single encoding.
2. Handling of escape sequences should continue to work as it does
now, but with all possible source code encodings, that is
standard string literals (both 8-bit and Unicode) are subject to
escape sequence expansion while raw string literals only expand
a very small subset of escape sequences.
3. Python's tokenizer/compiler combo will need to be updated to
work as follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. tokenize the Unicode content
4. compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the Unicode data into 8-bit string data
using the given file encoding
To make this backwards compatible, the implementation would have to
assume Latin-1 as the original file encoding if not given (otherwise,
binary data currently stored in 8-bit strings wouldn't make the
4. The encoding used in a Python source file should be easily
parseable for en editor; a magic comment at the top of the
file seems to be what people want to see, so I'll drop the
directive (PEP 244) requirement in the PEP.
Issues that still need to be resolved:
- how to enable embedding of differently encoded data in Python
source code (e.g. UTF-8 encoded XML data in a Latin-1
- what to do with non-literal data in the source file, e.g.
variable names and comments:
* reencode them just as would be done for literals
* only allow ASCII for certain elements like variable names
- which format to use for the magic comment, e.g.
* Emacs style:
# -*- encoding = 'utf-8' -*-
* Via meta-option to the interpreter:
* Using a special comment format:
#!encoding = 'utf-8'
Comments are welcome !
CEO eGenix.com Software GmbH
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/