Mailman 3 peps: Update to current object layout. - Python-checkins

Sept. 25, 2011

http://hg.python.org/peps/rev/a97dfa0fa127
changeset:   3944:a97dfa0fa127
user:        Martin v. Löwis <martin@v.loewis.de>
date:        Sun Sep 25 22:58:13 2011 +0200
summary:
  Update to current object layout.

files:
  pep-0393.txt |  191 ++++++++++++++++++++++----------------
  1 files changed, 112 insertions(+), 79 deletions(-)

diff --git a/pep-0393.txt b/pep-0393.txt
--- a/pep-0393.txt
+++ b/pep-0393.txt
@@ -47,52 +47,88 @@
 For many strings (e.g. ASCII), multiple representations may actually
 share memory (e.g. the shortest form may be shared with the UTF-8 form
 if all characters are ASCII). With such sharing, the overhead of
-compatibility representations is reduced.
+compatibility representations is reduced. If representations do share
+data, it is also possible to omit structure fields, reducing the base
+size of string objects.
 
 Specification
 =============
 
-The Unicode object structure is changed to this definition::
+Unicode structures are now defined as a hierarchy of structures,
+namely::
 
   typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
+    Py_hash_t hash;
+    struct {
+        unsigned int interned:2;
+        unsigned int kind:2;
+        unsigned int compact:1;
+        unsigned int ascii:1;
+        unsigned int ready:1;
+    } state;
+    wchar_t *wstr;
+  } PyASCIIObject;
+
+  typedef struct {
+    PyASCIIObject _base;
+    Py_ssize_t utf8_length;
+    char *utf8;
+    Py_ssize_t wstr_length;
+  } PyCompactUnicodeObject;
+
+  typedef struct {
+    PyCompactUnicodeObject _base;
     union {
         void *any;
         Py_UCS1 *latin1;
         Py_UCS2 *ucs2;
         Py_UCS4 *ucs4;
     } data;
-    Py_hash_t hash;
-    int state;
-    Py_ssize_t utf8_length;
-    void *utf8;
-    Py_ssize_t wstr_length;
-    void *wstr;
   } PyUnicodeObject;
 
-These fields have the following interpretations:
+Objects for which both size and maximum character are known at
+creation time are called "compact" unicode objects; character data
+immediately follow the base structure. If the maximum character is
+less than 128, they use the PyASCIIObject structure, and the UTF-8
+data, the UTF-8 length and the wstr length are the same as the length
+and the ASCII data. For non-ASCII strings, the PyCompactObject
+structure is used. Resizing compact objects is not supported.
+
+Objects for which the maximum character is not given at creation time
+are called "legacy" objects, created through
+PyUnicode_FromStringAndSize(NULL, length). They use the
+PyUnicodeObject structure. Initially, their data is only in the wstr
+pointer; when PyUnicode_READY is called, the data pointer (union) is
+allocated. Resizing is possible as long PyUnicode_READY has not been
+called.
+
+The fields have the following interpretations:
 
 - length: number of code points in the string (result of sq_length)
-- data: shortest-form representation of the unicode string.
-  The string is null-terminated (in its respective representation).
-- hash: same as in Python 3.2
-- state:
-
-  * lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
-  * next 2 bits (mask 0x0C) - form of str:
-
+- interned: interned-state (SSTATE_*) as in 3.2
+- kind: form of string
     + 00 => str is not initialized (data are in wstr)
     + 01 => 1 byte (Latin-1)
     + 10 => 2 byte (UCS-2)
     + 11 => 4 byte (UCS-4);
-
-  * next bit (mask 0x10): 1 if str memory follows PyUnicodeObject  
-
-- utf8_length, utf8: UTF-8 representation (null-terminated)
+- compact: the object uses one of the compact representations
+  (implies ready)
+- ascii: the object uses the PyASCIIObject representation
+  (implies compact and ready)
+- ready: the canonical represenation is ready to be accessed through
+  PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the
+  object is compact, or the data pointer and length have been
+  initialized.
 - wstr_length, wstr: representation in platform's wchar_t
   (null-terminated). If wchar_t is 16-bit, this form may use surrogate
   pairs (in which cast wstr_length differs form length).
+  wstr_length differs from length only if there are surrogate pairs
+  in the representation.
+- utf8_length, utf8: UTF-8 representation (null-terminated).
+- data: shortest-form representation of the unicode string.
+  The string is null-terminated (in its respective representation).
 
 All three representations are optional, although the data form is
 considered the canonical representation which can be absent only
@@ -111,10 +147,6 @@
 BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
 non-BMP characters if sizeof(wchar_t) is 4).
 
-If the string is created directly with the canonical representation
-(see below), this representation doesn't take a separate memory block,
-but is allocated right after the PyUnicodeObject struct.
-
 String Creation
 ---------------
 
@@ -140,12 +172,11 @@
 or implicitly). Resizing a Unicode string remains possible until it
 is finalized.
 
-PyUnicode_Ready() converts a string containing only a wstr
+PyUnicode_READY() converts a string containing only a wstr
 representation into the canonical representation. Unless wstr and data
 can share the memory, the wstr representation is discarded after the
-conversion. PyUnicode_FAST_READY() is a wrapper that avoids the 
-function call if the string is already ready. Both APIs return 0
-on success and -1 on failure.
+conversion. The macro returns 0 on success and -1 on failure, which
+happens in particular if the memory allocation fails.
 
 String Access
 -------------
@@ -175,9 +206,6 @@
 converts a string to a char* (such as the ParseTuple functions) will
 use PyUnicode_AsUTF8 to compute a conversion.
 
-PyUnicode_AsUnicode is deprecated; it computes the wstr representation
-on first use.
-
 Stable ABI
 ----------
 
@@ -189,27 +217,37 @@
 about the internals of CPython's data types, include PyUnicodeObject
 instances.  It will need to be slightly updated to track the change.
 
+Deprecations, Removals, and Incompatibilities
+---------------------------------------------
+
+While the Py_UNICODE representation and APIs are deprecated with this
+PEP, no removal of the respective APIs is scheduled. The APIs should
+remain available at least five years after the PEP is accepted; before
+they are removed, existing extension modules should be studied to find
+out whether a sufficient majority of the open-source code on PyPI has
+been ported to the new API. A reasonable motivation for using the
+deprecated API even in new code is for code that shall work both on
+Python 2 and Python 3.
+
+_PyUnicode_AsDefaultEncodedString is removed. It previously returned a
+borrowed reference to an UTF-8-encoded bytes object. Since the unicode
+object cannot anymore cache such a reference, implementing it without
+leaking memory is not possible. No deprecation phase is provided,
+since it was an API for internal use only.
+
+Extension modules using the legacy API may inadvertently call
+PyUnicode_READY, by calling some API that requires that the object is
+ready, and then continue accessing the (now invalid) Py_UNICODE
+pointer. Such code will break with this PEP. The code was already
+flawed in 3.2, as there is was no explicit guarantee that the
+PyUnicode_AS_UNICODE result would stay valid after an API call (due to
+the possiblity of string resizing). Modules that face this issue
+need to re-fetch the Py_UNICODE pointer after API calls; doing
+so will continue to work correctly in earlier Python versions.
+
 Open Issues
 ===========
 
-- When an application uses the legacy API, it may hold onto
-  the Py_UNICODE* representation, and yet start calling Unicode
-  APIs, which would call PyUnicode_Ready, invalidating the 
-  Py_UNICODE* representation; this would be an incompatible change.
-  The following solutions can be considered:
-
-  * accept it as an incompatible change. Applications using the
-    legacy API will have to fill out the Py_UNICODE buffer completely
-    before calling any API on the string under construction.
-  * require explicit PyUnicode_Ready calls in such applications;
-    fail with a fatal error if a non-ready string is ever read.
-    This would also be an incompatible change, but one that is
-    more easily detected during testing.
-  * as a compromise between these approaches, implicit PyUnicode_Ready
-    calls (i.e. those not deliberately following the construction of
-    a PyUnicode object) could produce a warning if they convert an
-    object.
-
 - Which of the APIs created during the development of the PEP should
   be public?
 
@@ -226,11 +264,6 @@
 applications that care about this problem can be rewritten to use the
 data representation.
 
-The question was raised whether the wchar_t representation is
-discouraged, or scheduled for removal. This is not the intent of this
-PEP; applications that use them will see a performance penalty,
-though. Future versions of Python may consider to remove them.
-
 Performance
 -----------
 
@@ -240,31 +273,31 @@
 a reduction in memory usage. For small strings, the effects depend on
 the pointer size of the system, and the size of the Py_UNICODE/wchar_t
 type. The following table demonstrates this for various small ASCII
-string sizes and platforms.
+and Latin-1 string sizes and platforms.
 
-+-------+---------------------------------+----------------+
-|string | Python 3.2                      | This PEP       |
-|size   +----------------+----------------+                |
-|       | 16-bit wchar_t | 32-bit wchar_t |                |
-|       +---------+------+--------+-------+--------+-------+
-|       | 32-bit  |64-bit| 32-bit |64-bit | 32-bit |64-bit |
-+-------+---------+------+--------+-------+--------+-------+
-|1      | 40      | 64   | 40     |  64   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|2      | 40      | 64   | 48     |  72   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|3      | 40      | 64   | 48     |  72   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|4      | 48      | 72   | 56     |  80   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|5      | 48      | 72   | 56     |  80   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|6      | 48      | 72   | 64     |  88   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|7      | 48      | 72   | 64     |  88   | 48     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
-|8      | 56      | 80   | 72     |  96   | 56     | 88    |
-+-------+---------+------+--------+-------+--------+-------+
++-------+---------------------------------+---------------------------------+
+|string | Python 3.2                      | This PEP                        |
+|size   +----------------+----------------+----------------+----------------+
+|       | 16-bit wchar_t | 32-bit wchar_t |   ASCII        |   Latin-1      |
+|       +---------+------+--------+-------+--------+-------+--------+-------+
+|       | 32-bit  |64-bit| 32-bit |64-bit | 32-bit |64-bit | 32-bit |64-bit |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|1      | 32      | 64   | 40     |  64   | 32     | 56    | 40     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|2      | 40      | 64   | 40     |  72   | 32     | 56    | 40     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|3      | 40      | 64   | 48     |  72   | 32     | 56    | 40     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|4      | 40      | 72   | 48     |  80   | 32     | 56    | 48     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|5      | 40      | 72   | 56     |  80   | 32     | 56    | 48     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|6      | 48      | 72   | 56     |  88   | 32     | 56    | 48     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|7      | 48      | 72   | 64     |  88   | 32     | 56    | 48     | 80    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
+|8      | 48      | 80   | 64     |  96   | 40     | 64    | 48     | 88    |
++-------+---------+------+--------+-------+--------+-------+--------+-------+
 
 The runtime effect is significantly affected by the API being
 used. After porting the relevant pieces of code to the new API,

-- 
Repository URL: http://hg.python.org/peps

    

peps: Update to current object layout.

martin.v.loewis

tags

participants (1)