[Python-checkins] r88167 - peps/trunk/pep-0393.txt

Mon Jan 24 21:00:09 CET 2011

Author: martin.v.loewis
Date: Mon Jan 24 21:00:09 2011
New Revision: 88167

Log:
Add PEP on Unicode strings.


Added:
   peps/trunk/pep-0393.txt   (contents, props changed)

Added: peps/trunk/pep-0393.txt
==============================================================================

--- (empty file)
+++ peps/trunk/pep-0393.txt	Mon Jan 24 21:00:09 2011
@@ -0,0 +1,181 @@
+PEP: 393
+Title: Flexible String Representation
+Version: $Revision: 87809 $
+Last-Modified: $Date: 2011-01-06 20:33:28 +0100 (Do, 06. Jan 2011) $
+Author: Martin v. Löwis <martin at v.loewis.de>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 24-Jan-2010
+Python-Version: 3.3
+Post-History:
+
+Abstract
+========
+
+The Unicode string type is changed to support multiple internal
+representations, depending on the character with the largest Unicode
+ordinal (1, 2, or 4 bytes). This will allow a space-efficient
+representation in common cases, but give access to full UCS-4 on all
+systems. For compatibility with existing APIs, several representations
+may exist in parallel; over time, this compatibility should be phased
+out.
+
+Rationale
+=========
+
+There are two classes of complaints about the current implementation
+of the unicode type: on systems only supporting UTF-16, users complain
+that non-BMP characters are not properly supported. On systems using
+UCS-4 internally (and also sometimes on systems using UCS-2), there is
+a complaint that Unicode strings take up too much memory - especially
+compared to Python 2.x, where the same code would often use ASCII
+strings (i.e. ASCII-encoded byte strings). With the proposed approach,
+ASCII-only Unicode strings will again use only one byte per character;
+while still allowing efficient indexing of strings containing non-BMP
+characters (as strings containing them will use 4 bytes per
+character).
+
+One problem with the approach is support for existing applications
+(e.g. extension modules). For compatibility, redundant representations
+may be computed. Applications are encouraged to phase out reliance on
+a specific internal representation if possible. As interaction with
+other libraries will often require some sort of internal
+representation, the specification choses UTF-8 as the recommended way
+of exposing strings to C code.
+
+For many strings (e.g. ASCII), multiple representations may actually
+share memory (e.g. the shortest form may be shared with the UTF-8 form
+if all characters are ASCII). With such sharing, the overhead of
+compatibility representations is reduced.
+
+Specification
+=============
+
+The Unicode object structure is changed to this definition::
+
+  typedef struct {
+    PyObject_HEAD
+    Py_ssize_t length;
+    void *str;
+    Py_hash_t hash;
+    int state;
+    Py_ssize_t utf8_length;
+    void *utf8;
+    Py_ssize_t wstr_length;
+    void *wstr;
+  } PyUnicodeObject;
+
+These fields have the following interpretations:
+
+- length: number of code points in the string (result of sq_length)
+- str: shortest-form representation of the unicode string; the lower
+  two bits of the pointer indicate the specific form:
+  01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);
+  00 => null pointer
+
+  The string is null-terminated (in its respective representation).
+- hash, state: same as in Python 3.2
+- utf8_length, utf8: UTF-8 representation (null-terminated)
+- wstr_length, wstr: representation in platform's wchar_t
+  (null-terminated). If wchar_t is 16-bit, this form may use surrogate
+  pairs (in which cast wstr_length differs form length).
+
+All three representations are optional, although the str form is
+considered the canonical representation which can be absent only
+while the string is being created.
+
+The Py_UNICODE type is still supported but deprecated. It is always
+defined as a typedef for wchar_t, so the wstr representation can double
+as Py_UNICODE representation.
+
+The str and utf8 pointers point to the same memory if the string uses
+only ASCII characters (using only Latin-1 is not sufficient). The str
+and wstr pointers point to the same memory if the string happens to
+fit exactly to the wchar_t type of the platform (i.e. uses some
+BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
+non-BMP characters if sizeof(wchar_t) is 4).
+
+If the string is created directly with the canonical representation
+(see below), this representation doesn't take a separate memory block,
+but is allocated right after the PyUnicodeObject struct.
+
+String Creation
+---------------
+
+The recommended way to create a Unicode object is to use the function
+PyUnicode_New::
+
+   PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
+
+Both parameters must denote the eventual size/range of the strings.
+In particular, codecs using this API must compute both the number of
+characters and the maximum character in advance. An string is
+allocated according to the specified size and character range and is
+null-terminated; the actual characters in it may be unitialized.
+
+PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
+for processing UTF-8 input; the input is decoded, and the UTF-8
+representation is not yet set for the string.
+
+PyUnicode_FromUnicode remains supported but is deprecated. If the
+Py_UNICODE pointer is non-null, the str representation is set. If the
+pointer is NULL, a properly-sized wstr representation is allocated,
+which can be modified until PyUnicode_Finalize() is called (explicitly
+or implicitly). Resizing a Unicode string remains possible until it
+is finalized.
+
+PyUnicode_Finalize() converts a string containing only a wstr
+representation into the canonical representation. Unless wstr and str
+can share the memory, the wstr representation is discarded after the
+conversion.
+
+String Access
+-------------
+
+The canonical representation can be accessed using two macros
+PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
+value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE
+(3). PyUnicode_Data gives the void pointer to the data, masking out
+the pointer kind. All these functions call PyUnicode_Finalize
+in case the canonical representation hasn't been computed yet.
+
+A new function PyUnicode_AsUTF8 is provided to access the UTF-8
+representation. It is thus identical to the existing
+_PyUnicode_AsString, which is removed. The function will compute the
+utf8 representation when first called. Since this representation will
+consume memory until the string object is released, applications
+should use the existing PyUnicode_AsUTF8String where possible
+(which generates a new string object every time). API that implicitly
+converts a string to a char* (such as the ParseTuple functions) will
+use this function to compute a conversion.
+
+PyUnicode_AsUnicode is deprecated; it computes the wstr representation
+on first use.
+
+String Operations
+-----------------
+
+Various convenience functions will be provided to deal with the
+canonical representation, in particular with respect to concatenation
+and slicing.
+
+Stable ABI
+----------
+
+None of the functions in this PEP become part of the stable ABI.
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End: