[Python-Dev] Re: PEP: Defining Python Source Code Encodings

M.-A. Lemburg mal@lemburg.com
Tue, 17 Jul 2001 14:11:21 +0200

Roman Suzi wrote:
> On Tue, 17 Jul 2001, M.-A. Lemburg wrote:
> > After having been through two rounds of comments with the "Unicode
> > Literal Encoding" pre-PEP, it has turned out that people actually
> > prefer to go for the full Monty meaning that the PEP should handle
> > the complete Python source code encoding and not just the encoding
> > of the Unicode literals (which are currently the only parts in a
> > Python source code file for which Python assumes a fixed encoding).
> >
> > Here's a summary of what I've learned from the comments:
> >
> > 1. The complete Python source file should use a single encoding.
> Yes, certainly
> > 2. Handling of escape sequences should continue to work as it does
> >    now, but with all possible source code encodings, that is
> >    standard string literals (both 8-bit and Unicode) are subject to
> >    escape sequence expansion while raw string literals only expand
> >    a very small subset of escape sequences.
> >
> > 3. Python's tokenizer/compiler combo will need to be updated to
> >    work as follows:
> >
> >    1. read the file
> >    2. decode it into Unicode assuming a fixed per-file encoding
> >    3. tokenize the Unicode content
> >    4. compile it, creating Unicode objects from the given Unicode data
> >       and creating string objects from the Unicode literal data
> >       by first reencoding the Unicode data into 8-bit string data
> >       using the given file encoding
> I think, that if encoding is not given, it must sillently assume "UNKNOWN"
> encoding and do nothing, that is be 8-bit clean (as it is now).

To be 8-bit clean it will have to use Latin-1 as fallback encoding
since this encoding assures the roundtrip safety (decode to Unicode,
then reencode).
> Otherwise, it will slow down parser considerably.

Yes, that could be an issue (I don't think it matters much though,
since parsing usually only done during byte-code compilation and
the results are buffered in .pyc files).
> I also think that if encoding is choosen, there is no need to reencode it
> back to literal strings: let them be in Unicode.

That would be nice, but is not feasable at the moment (just try
to run Python with -U option and see what happens...).
> Or the encoding must _always_ be ASCII+something, as utf-8 for example.
> Eliminating the need to bother with tokenizer (Because only docstrings,
> comments and string-literals are entities which require encoding /
> decoding).
> If I understood correctly, Python will soon switch to "unicode-only"
> strings, as Java and Tcl did. (This is of course disaster for some Python
> usage areas such as fast text-processing, but...)
> Or am I missing something?

It won't switch any time soon... there's still too much work
ahead and I'm also pretty sure that the 8-bit string type won't
go away for backward compatibility reasons.
> >    To make this backwards compatible, the implementation would have to
> >    assume Latin-1 as the original file encoding if not given (otherwise,
> >    binary data currently stored in 8-bit strings wouldn't make the
> >    roundtrip).
> ...as I said, there must be no assumed charset. Things must
> be left as is now when no explicit encoding given.

This is what the Latin-1 encoding assures.
> > 4. The encoding used in a Python source file should be easily
> >    parseable for en editor; a magic comment at the top of the
> >    file seems to be what people want to see, so I'll drop the
> >    directive (PEP 244) requirement in the PEP.
> >
> > Issues that still need to be resolved:
> >
> > - how to enable embedding of differently encoded data in Python
> >   source code (e.g. UTF-8 encoded XML data in a Latin-1
> >   source file)
> Probably, adding explicit conversions.

Yes, but there are cases where the source file having the embedded
data will not decode into Unicode (I got the example backwards:
think of a UTF-8 encoded source file with a Latin-1 string literal).

Perhaps we should simply rule out this case and have the 
programmer stick to the source file encoding + some escaping
or a run-time recoding of the literal data into the preferred
> > - what to do with non-literal data in the source file, e.g.
> >   variable names and comments:
> >
> >   * reencode them just as would be done for literals
> >   * only allow ASCII for certain elements like variable names
> >   etc.
> I think non-literal data must be in ASCII.
> But it could be too cheesy to have variable names in national
> alphabet ;-)

That's for Guido to decide...
> > - which format to use for the magic comment, e.g.
> >
> >   * Emacs style:
> >
> >       #!/usr/bin/python
> >       # -*- encoding = 'utf-8' -*-
> >
> >   * Via meta-option to the interpreter:
> >
> >       #!/usr/bin/python --encoding=utf-8
> >
> >   * Using a special comment format:
> >
> >       #!/usr/bin/python
> >       #!encoding = 'utf-8'
> No variant is ideal. The 2nd is worse/best than all
> (it depends on how to look at it!)
> Python has no macro directives. In this situation
> they could help greatly!

We've been discussing these on python-dev, but Guido is not
too keen on having them.
> That "#!encoding" is special case of macro directive.
> May be just put something like ''# <!DOCTYPE HTML PUBLIC''
> at the beginning...
> Or, even greater idea occured to me: allow some XML
> with meta-information (not only encoding) somehow escaped.
> I think, GvR could come with some advice here...
> > Comments are welcome !

Thanks for your comments,
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/