[Python-Dev] Re: PEP: Defining Python Source Code Encodings
M.-A. Lemburg
mal@lemburg.com
Tue, 17 Jul 2001 14:11:21 +0200
Roman Suzi wrote:
>
> On Tue, 17 Jul 2001, M.-A. Lemburg wrote:
>
> > After having been through two rounds of comments with the "Unicode
> > Literal Encoding" pre-PEP, it has turned out that people actually
> > prefer to go for the full Monty meaning that the PEP should handle
> > the complete Python source code encoding and not just the encoding
> > of the Unicode literals (which are currently the only parts in a
> > Python source code file for which Python assumes a fixed encoding).
> >
> > Here's a summary of what I've learned from the comments:
> >
> > 1. The complete Python source file should use a single encoding.
>
> Yes, certainly
>
> > 2. Handling of escape sequences should continue to work as it does
> > now, but with all possible source code encodings, that is
> > standard string literals (both 8-bit and Unicode) are subject to
> > escape sequence expansion while raw string literals only expand
> > a very small subset of escape sequences.
> >
> > 3. Python's tokenizer/compiler combo will need to be updated to
> > work as follows:
> >
> > 1. read the file
> > 2. decode it into Unicode assuming a fixed per-file encoding
> > 3. tokenize the Unicode content
> > 4. compile it, creating Unicode objects from the given Unicode data
> > and creating string objects from the Unicode literal data
> > by first reencoding the Unicode data into 8-bit string data
> > using the given file encoding
>
> I think, that if encoding is not given, it must sillently assume "UNKNOWN"
> encoding and do nothing, that is be 8-bit clean (as it is now).
To be 8-bit clean it will have to use Latin-1 as fallback encoding
since this encoding assures the roundtrip safety (decode to Unicode,
then reencode).
> Otherwise, it will slow down parser considerably.
Yes, that could be an issue (I don't think it matters much though,
since parsing usually only done during byte-code compilation and
the results are buffered in .pyc files).
> I also think that if encoding is choosen, there is no need to reencode it
> back to literal strings: let them be in Unicode.
That would be nice, but is not feasable at the moment (just try
to run Python with -U option and see what happens...).
> Or the encoding must _always_ be ASCII+something, as utf-8 for example.
> Eliminating the need to bother with tokenizer (Because only docstrings,
> comments and string-literals are entities which require encoding /
> decoding).
>
> If I understood correctly, Python will soon switch to "unicode-only"
> strings, as Java and Tcl did. (This is of course disaster for some Python
> usage areas such as fast text-processing, but...)
>
> Or am I missing something?
It won't switch any time soon... there's still too much work
ahead and I'm also pretty sure that the 8-bit string type won't
go away for backward compatibility reasons.
> > To make this backwards compatible, the implementation would have to
> > assume Latin-1 as the original file encoding if not given (otherwise,
> > binary data currently stored in 8-bit strings wouldn't make the
> > roundtrip).
>
> ...as I said, there must be no assumed charset. Things must
> be left as is now when no explicit encoding given.
This is what the Latin-1 encoding assures.
> > 4. The encoding used in a Python source file should be easily
> > parseable for en editor; a magic comment at the top of the
> > file seems to be what people want to see, so I'll drop the
> > directive (PEP 244) requirement in the PEP.
> >
> > Issues that still need to be resolved:
> >
> > - how to enable embedding of differently encoded data in Python
> > source code (e.g. UTF-8 encoded XML data in a Latin-1
> > source file)
>
> Probably, adding explicit conversions.
Yes, but there are cases where the source file having the embedded
data will not decode into Unicode (I got the example backwards:
think of a UTF-8 encoded source file with a Latin-1 string literal).
Perhaps we should simply rule out this case and have the
programmer stick to the source file encoding + some escaping
or a run-time recoding of the literal data into the preferred
encoding.
> > - what to do with non-literal data in the source file, e.g.
> > variable names and comments:
> >
> > * reencode them just as would be done for literals
> > * only allow ASCII for certain elements like variable names
> > etc.
>
> I think non-literal data must be in ASCII.
> But it could be too cheesy to have variable names in national
> alphabet ;-)
That's for Guido to decide...
> > - which format to use for the magic comment, e.g.
> >
> > * Emacs style:
> >
> > #!/usr/bin/python
> > # -*- encoding = 'utf-8' -*-
> >
> > * Via meta-option to the interpreter:
> >
> > #!/usr/bin/python --encoding=utf-8
> >
> > * Using a special comment format:
> >
> > #!/usr/bin/python
> > #!encoding = 'utf-8'
>
> No variant is ideal. The 2nd is worse/best than all
> (it depends on how to look at it!)
>
> Python has no macro directives. In this situation
> they could help greatly!
We've been discussing these on python-dev, but Guido is not
too keen on having them.
> That "#!encoding" is special case of macro directive.
>
> May be just put something like ''# <!DOCTYPE HTML PUBLIC''
> at the beginning...
>
> Or, even greater idea occured to me: allow some XML
> with meta-information (not only encoding) somehow escaped.
>
> I think, GvR could come with some advice here...
>
> > Comments are welcome !
Thanks for your comments,
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/