[Python-Dev] PEP 393: Flexible String Representation

Sat Jan 29 01:54:08 CET 2011

Pardon me for this drive-by posting, but this thread smells a lot like this
old thread (don't be afraid to read it all, there are some good points in
there; not directed at you Martin, but at all readers/posters in this
thread)...

http://mail.python.org/pipermail/python-3000/2006-September/003795.html

<http://mail.python.org/pipermail/python-3000/2006-September/003795.html>I'm
not averse to faster and/or more memory efficient unicode representations (I
would be quite happy with them, actually). I do see the usefulness of having
non-utf-8 representations, and caching them is a good idea, though I wonder
if that is a "good for Python itself to cache", or "good for the application
to cache".

The evil side of me says that we should just provide an API available in
Python/C for "give me the representation of unicode string X using the
2byte/4byte code points", and have it just return the appropriate
array.array() value (useful for passing to other APIs, or for those who need
to do manual manipulation of code-points), or whatever structure is deemed
to be appropriate.

The less evil side of me says that going with what the PEP offers isn't a
bad idea, and might just be a good idea.

I'll defer my vote to Martin.

Regards,
 - Josiah

On Mon, Jan 24, 2011 at 12:17 PM, "Martin v. Löwis" <martin at v.loewis.de>wrote:

> I have been thinking about Unicode representation for some time now.
> This was triggered, on the one hand, by discussions with Glyph Lefkowitz
> (who complained that his server app consumes too much memory), and Carl
> Friedrich Bolz (who profiled Python applications to determine that
> Unicode strings are among the top consumers of memory in Python).
> On the other hand, this was triggered by the discussion on supporting
> surrogates in the library better.
>
> I'd like to propose PEP 393, which takes a different approach,
> addressing both problems simultaneously: by getting a flexible
> representation (one that can be either 1, 2, or 4 bytes), we can
> support the full range of Unicode on all systems, but still use
> only one byte per character for strings that are pure ASCII (which
> will be the majority of strings for the majority of users).
>
> You'll find the PEP at
>
> http://www.python.org/dev/peps/pep-0393/
>
> For convenience, I include it below.
>
> Regards,
> Martin
>
> PEP: 393
> Title: Flexible String Representation
> Version: $Revision: 88168 $
> Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $
> Author: Martin v. Löwis <martin at v.loewis.de>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 24-Jan-2010
> Python-Version: 3.3
> Post-History:
>
> Abstract
> ========
>
> The Unicode string type is changed to support multiple internal
> representations, depending on the character with the largest Unicode
> ordinal (1, 2, or 4 bytes). This will allow a space-efficient
> representation in common cases, but give access to full UCS-4 on all
> systems. For compatibility with existing APIs, several representations
> may exist in parallel; over time, this compatibility should be phased
> out.
>
> Rationale
> =========
>
> There are two classes of complaints about the current implementation
> of the unicode type: on systems only supporting UTF-16, users complain
> that non-BMP characters are not properly supported. On systems using
> UCS-4 internally (and also sometimes on systems using UCS-2), there is
> a complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> One problem with the approach is support for existing applications
> (e.g. extension modules). For compatibility, redundant representations
> may be computed. Applications are encouraged to phase out reliance on
> a specific internal representation if possible. As interaction with
> other libraries will often require some sort of internal
> representation, the specification choses UTF-8 as the recommended way
> of exposing strings to C code.
>
> For many strings (e.g. ASCII), multiple representations may actually
> share memory (e.g. the shortest form may be shared with the UTF-8 form
> if all characters are ASCII). With such sharing, the overhead of
> compatibility representations is reduced.
>
> Specification
> =============
>
> The Unicode object structure is changed to this definition::
>
>  typedef struct {
>    PyObject_HEAD
>    Py_ssize_t length;
>    void *str;
>    Py_hash_t hash;
>    int state;
>    Py_ssize_t utf8_length;
>    void *utf8;
>    Py_ssize_t wstr_length;
>    void *wstr;
>  } PyUnicodeObject;
>
> These fields have the following interpretations:
>
> - length: number of code points in the string (result of sq_length)
> - str: shortest-form representation of the unicode string; the lower
>  two bits of the pointer indicate the specific form:
>  01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);
>  00 => null pointer
>
>  The string is null-terminated (in its respective representation).
> - hash, state: same as in Python 3.2
> - utf8_length, utf8: UTF-8 representation (null-terminated)
> - wstr_length, wstr: representation in platform's wchar_t
>  (null-terminated). If wchar_t is 16-bit, this form may use surrogate
>  pairs (in which cast wstr_length differs form length).
>
> All three representations are optional, although the str form is
> considered the canonical representation which can be absent only
> while the string is being created.
>
> The Py_UNICODE type is still supported but deprecated. It is always
> defined as a typedef for wchar_t, so the wstr representation can double
> as Py_UNICODE representation.
>
> The str and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient). The str
> and wstr pointers point to the same memory if the string happens to
> fit exactly to the wchar_t type of the platform (i.e. uses some
> BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
> non-BMP characters if sizeof(wchar_t) is 4).
>
> If the string is created directly with the canonical representation
> (see below), this representation doesn't take a separate memory block,
> but is allocated right after the PyUnicodeObject struct.
>
> String Creation
> ---------------
>
> The recommended way to create a Unicode object is to use the function
> PyUnicode_New::
>
>   PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
>
> Both parameters must denote the eventual size/range of the strings.
> In particular, codecs using this API must compute both the number of
> characters and the maximum character in advance. An string is
> allocated according to the specified size and character range and is
> null-terminated; the actual characters in it may be unitialized.
>
> PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
> for processing UTF-8 input; the input is decoded, and the UTF-8
> representation is not yet set for the string.
>
> PyUnicode_FromUnicode remains supported but is deprecated. If the
> Py_UNICODE pointer is non-null, the str representation is set. If the
> pointer is NULL, a properly-sized wstr representation is allocated,
> which can be modified until PyUnicode_Finalize() is called (explicitly
> or implicitly). Resizing a Unicode string remains possible until it
> is finalized.
>
> PyUnicode_Finalize() converts a string containing only a wstr
> representation into the canonical representation. Unless wstr and str
> can share the memory, the wstr representation is discarded after the
> conversion.
>
> String Access
> -------------
>
> The canonical representation can be accessed using two macros
> PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
> value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE
> (3). PyUnicode_Data gives the void pointer to the data, masking out
> the pointer kind. All these functions call PyUnicode_Finalize
> in case the canonical representation hasn't been computed yet.
>
> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
> representation. It is thus identical to the existing
> _PyUnicode_AsString, which is removed. The function will compute the
> utf8 representation when first called. Since this representation will
> consume memory until the string object is released, applications
> should use the existing PyUnicode_AsUTF8String where possible
> (which generates a new string object every time). API that implicitly
> converts a string to a char* (such as the ParseTuple functions) will
> use this function to compute a conversion.
>
> PyUnicode_AsUnicode is deprecated; it computes the wstr representation
> on first use.
>
> String Operations
> -----------------
>
> Various convenience functions will be provided to deal with the
> canonical representation, in particular with respect to concatenation
> and slicing.
>
> Stable ABI
> ----------
>
> None of the functions in this PEP become part of the stable ABI.
>
> Copyright
> =========
>
> This document has been placed in the public domain.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110128/d8200947/attachment-0001.html>