PEP: Defining Python Source Code Encodings
rnd at onego.ru
Tue Jul 17 13:08:02 CEST 2001
On Tue, 17 Jul 2001, M.-A. Lemburg wrote:
> After having been through two rounds of comments with the "Unicode
> Literal Encoding" pre-PEP, it has turned out that people actually
> prefer to go for the full Monty meaning that the PEP should handle
> the complete Python source code encoding and not just the encoding
> of the Unicode literals (which are currently the only parts in a
> Python source code file for which Python assumes a fixed encoding).
> Here's a summary of what I've learned from the comments:
> 1. The complete Python source file should use a single encoding.
> 2. Handling of escape sequences should continue to work as it does
> now, but with all possible source code encodings, that is
> standard string literals (both 8-bit and Unicode) are subject to
> escape sequence expansion while raw string literals only expand
> a very small subset of escape sequences.
> 3. Python's tokenizer/compiler combo will need to be updated to
> work as follows:
> 1. read the file
> 2. decode it into Unicode assuming a fixed per-file encoding
> 3. tokenize the Unicode content
> 4. compile it, creating Unicode objects from the given Unicode data
> and creating string objects from the Unicode literal data
> by first reencoding the Unicode data into 8-bit string data
> using the given file encoding
I think, that if encoding is not given, it must sillently assume "UNKNOWN"
encoding and do nothing, that is be 8-bit clean (as it is now).
Otherwise, it will slow down parser considerably.
I also think that if encoding is choosen, there is no need to reencode it
back to literal strings: let them be in Unicode.
Or the encoding must _always_ be ASCII+something, as utf-8 for example.
Eliminating the need to bother with tokenizer (Because only docstrings,
comments and string-literals are entities which require encoding /
If I understood correctly, Python will soon switch to "unicode-only"
strings, as Java and Tcl did. (This is of course disaster for some Python
usage areas such as fast text-processing, but...)
Or am I missing something?
> To make this backwards compatible, the implementation would have to
> assume Latin-1 as the original file encoding if not given (otherwise,
> binary data currently stored in 8-bit strings wouldn't make the
...as I said, there must be no assumed charset. Things must
be left as is now when no explicit encoding given.
> 4. The encoding used in a Python source file should be easily
> parseable for en editor; a magic comment at the top of the
> file seems to be what people want to see, so I'll drop the
> directive (PEP 244) requirement in the PEP.
> Issues that still need to be resolved:
> - how to enable embedding of differently encoded data in Python
> source code (e.g. UTF-8 encoded XML data in a Latin-1
> source file)
Probably, adding explicit conversions.
> - what to do with non-literal data in the source file, e.g.
> variable names and comments:
> * reencode them just as would be done for literals
> * only allow ASCII for certain elements like variable names
I think non-literal data must be in ASCII.
But it could be too cheesy to have variable names in national
> - which format to use for the magic comment, e.g.
> * Emacs style:
> # -*- encoding = 'utf-8' -*-
> * Via meta-option to the interpreter:
> #!/usr/bin/python --encoding=utf-8
> * Using a special comment format:
> #!encoding = 'utf-8'
No variant is ideal. The 2nd is worse/best than all
(it depends on how to look at it!)
Python has no macro directives. In this situation
they could help greatly!
That "#!encoding" is special case of macro directive.
May be just put something like ''# <!DOCTYPE HTML PUBLIC''
at the beginning...
Or, even greater idea occured to me: allow some XML
with meta-information (not only encoding) somehow escaped.
I think, GvR could come with some advice here...
> Comments are welcome !
Sincerely yours, Roman A.Suzi
- Petrozavodsk - Karelia - Russia - mailto:rnd at onego.ru -
More information about the Python-list