[Python-Dev] RE: Defining Unicode Literal Encodings

M.-A. Lemburg mal@lemburg.com
Fri, 13 Jul 2001 23:56:40 +0200

Tim Peters wrote:
> [M.-A. Lemburg]
> > PEP: 0263 (?)
> > Title: Defining Unicode Literal Encodings
> > Version: $Revision: 1.0 $
> > Author: mal@lemburg.com (Marc-Andr=E9 Lemburg)
> > Status: Draft
> > Type: Standards Track
> > Python-Version: 2.3
> > Created: 06-Jun-2001
> > Post-History:
> Since this depends on PEP 244, it should also have a
>   Requires: 244
> header line.

Ok, I'll add that.
> > ...
> > ... can be set using the "directive" statement proposed in PEP 244.
> >
> >     The syntax for the directives is as follows:
> >
> >     'directive' WS+ 'unicodeencoding' WS* '=3D' WS* PYTHONSTRINGLITER=
> >     'directive' WS+ 'rawunicodeencoding' WS* '=3D' WS* PYTHONSTRINGLI=
> PEP 244 doesn't allow these spellings:  at most one atom is allowed aft=
> the directive name, and
>     =3D "whatever"
> isn't an atom.  Remove the '=3D' and PEP 244 is happy, though.  If you =
want to
> keep the "=3D", PEP 244 has to change.

True... would that pose a problem ?
> I think that there should be a single directive for:
>  * unicode strings
>  * 8-bit strings
>  * comments
> If a user uses UTF-8 for 8-bit strings and Shift-JIS for Unicode, there
> is basically no text editor in the world that is going to do the right
> thing. And it isn't possible for a web server to properly associate an
> encoding. In general, it isn't a useful configuration.

Please don't mix 8-bit strings with Unicode literals: 8-bit
strings don't carry any encoding information, so providing encoding
information cannot be stored anywhere.=20

Comments, OTOH, are part of the program text, so they have to be ASCII
just like the Python source itself.

Note that it doesn't make sense to use a non-ASCII superset
for the Unicode literal encoding (as you and others have noted).
Since all builtin Python encodings are ASCII-supersets, this
shouldn't pose much of a problem, though ;-)
> Also, no matter what the directive says, I think that \uXXXX should
> continue to work. Just as in 8-bit strings, it should be possible to mi=
> and match direct encoded input and backslash-escaped characters.
> Sometimes one is convenient (because of your keyboard setup) and
> sometimes the other is convenient. This proposal exists only to improve
> typing convenience so we should go all the way and allow both.

Hmm, good point, but hard to implement. We'd probably need a two
phase decoding for this to work:

1. decode the given Unicode literal encoding
2. decode any Unicode escapes in the Unicode string
> I strongly think we should restrict the directive to one per file and i=
> fact I would say it should be one of the first two lines. It should be
> immediately following the shebang line if there is one. This is to allo=
> text editors to detect it as they detect XML encoding declarations.
> My opinions are influenced by the fact that I've helped implement
> Unicode support in an Python/XML editor. XML makes it easy to give the
> user a good experience. Python could too if we are careful.

I think that allowing one directive per file is the way to go,
but I'm not sure about the exact position. Basically, I think it
should go "near" the top, but not necessarily before any doc-string
in the file.
> [Guido]
> > Hm, then the directive would syntactically have to *precede* the
> > docstring.  That currently doesn't work -- the docstring may only be
> > preceded by blank lines and comments.  Lots of tools for processing
> > docstrings already have this built into them.  Is it worth breaking
> > them so that editors can remain stupid?
> No.


Note that the PEP doesn't require the directive to be placed before the
doc-string. That point is still open. Technically, the compiler
will only need to know about the encoding before the first
Unicode literal in the source file.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/