[Python-Dev] PEP: Defining Unicode Literal Encodings (revision 1.1)

M.-A. Lemburg mal@lemburg.com
Sat, 14 Jul 2001 00:21:32 +0200

Here's an updated version which clarifies some issues...


PEP: 0263 (?)
Title: Defining Unicode Literal Encodings
Version: $Revision: 1.1 $
Author: mal@lemburg.com (Marc-Andr=E9 Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001
Requires: 244


    This PEP proposes to use the PEP 244 statement "directive" to make
    the encoding used in Unicode string literals u"..." (and their raw
    counterparts ur"...") definable on a per source file basis.


    In Python 2.1, Unicode literals can only be written using the
    Latin-1 based encoding "unicode-escape". This makes the
    programming environment rather unfriendly to Python users who live
    and work in non-Latin-1 locales such as many of the Asian=20
    countries. Programmers can write their 8-bit strings using the
    favourite encoding, but are bound to the "unicode-escape" encoding
    for Unicode literals.

Proposed Solution

    I propose to make the Unicode literal encodings (both standard and
    raw) a per-source file option which can be set using the
    "directive" statement proposed in PEP 244 in a slightly extended
    form (by adding the '=3D' between the directive name and it's value).


    The syntax for the directives is as follows:

    'directive' WS+ 'unicodeencoding' WS* '=3D' WS* PYTHONSTRINGLITERAL
    'directive' WS+ 'rawunicodeencoding' WS* '=3D' WS* PYTHONSTRINGLITERA=

    with the PYTHONSTRINGLITERAL representing the encoding name to be
    used as standard Python 8-bit string literal and WS being the
    whitespace characters [ \t].


    Whenever the Python compiler sees such an encoding directive
    during the compiling process, it updates an internal flag which
    holds the encoding name used for the specific literal form. The
    encoding name flags are initialized to "unicode-escape" for u"..."=20
    literals and "raw-unicode-escape" for ur"..." respectively.

         Maybe we should restrict the directive usage to once per file
         and additionally to a placement before the first Unicode literal=
         in the source file.

         (Comments suggest that this approach suits the goal best.)

    If the Python compiler has to convert a Unicode literal to a
    Unicode object, it will pass the 8-bit string data given by the
    literal to the Python codec registry and have it decode the data
    using the current setting of the encoding name flag for the
    requested type of Unicode literal. It then checks the result of
    the decoding operation for being an Unicode object and stores it
    in the byte code stream.

    Since Python source code is defined to be ASCII, the Unicode literal
    encodings (both standard and raw) should be supersets of ASCII and=20
    match the encoding used elsewhere in the program text, e.g. in=20
    comments and maybe even 8-bit strings (even though their encoding=20
    is only implicit and completely under the programmer's control).
    It is the responsability of the programmer to choose reasonable=20


    This PEP only affects Python source code which makes use of the
    proposed directives. It does not affect the coercion handling of
    8-bit strings and Unicode in the given module.


    This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/