[Python-Dev] PEP: Defining Unicode Literal Encodings
M.-A. Lemburg
mal@lemburg.com
Fri, 13 Jul 2001 14:03:29 +0200
Please comment...
--
PEP: 0263 (?)
Title: Defining Unicode Literal Encodings
Version: $Revision: 1.0 $
Author: mal@lemburg.com (Marc-Andr=E9 Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 06-Jun-2001
Post-History:=20
Abstract
This PEP proposes to use the PEP 244 statement "directive" to make
the encoding used in Unicode string literals u"..." (and their raw
counterparts ur"...") definable on a per source file basis.
Problem
In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the
programming environment rather unfriendly to Python users who live
and work in non-Latin-1 locales such as many of the eastern
countries. Programmers can write their 8-bit strings using the
favourite encoding, but are bound to the "unicode-escape" encoding
for Unicode literals.
Proposed Solution
I propose to make the Unicode literal encodings (both standard and
raw) a per-source file option which can be set using the
"directive" statement proposed in PEP 244.
Syntax
The syntax for the directives is as follows:
'directive' WS+ 'unicodeencoding' WS* '=3D' WS* PYTHONSTRINGLITERAL
'directive' WS+ 'rawunicodeencoding' WS* '=3D' WS* PYTHONSTRINGLITERA=
L
with the PYTHONSTRINGLITERAL representing the encoding name to be
used as standard Python 8-bit string literal and WS being the
whitespace characters [ \t].
Semantics
Whenever the Python compiler sees such an encoding directive
during the compiling process, it updates an internal flag which
holds the encoding name used for the specific literal form. The
encoding name flags are initialized to "unicode-escape" for u"..."=20
literals and "raw-unicode-escape" for ur"..." respectively.
ISSUE:
Maybe we should restrict the directive usage to once per file
and additionally to a placement before the first Unicode
literal=20
in the source file.
If the Python compiler has to convert a Unicode literal to a
Unicode object, it will pass the 8-bit string data given by the
literal to the Python codec registry and have it decode the data
using the current setting of the encoding name flag for the
requested type of Unicode literal. It then checks the result of
the decoding operation for being an Unicode object and stores it
in the byte code stream.
Scope
This PEP only affects Python source code which makes use of the
proposed directives. It does not affect the coercion handling of
8-bit strings and Unicode in the given module.
Copyright
This document has been placed in the public domain.
=0C
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/