[Python-Dev] PEP 393 Summer of Code Project

Tue Aug 23 14:14:39 CEST 2011

Torsten Becker, 22.08.2011 20:58:
> I have implemented an initial version of PEP 393 -- "Flexible String
> Representation" as part of my Google Summer of Code project.  My patch
> is hosted as a repository on bitbucket [1] and I created a related
> issue on the bug tracker [2].  I posted documentation for the current
> state of the development in the wiki [3].

One thing that occurred to me regarding the object struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;       /* Number of code points in the string */
     void *str;               /* Canonical, smallest-form Unicode buffer */
     Py_hash_t hash;          /* Hash value; -1 if not set */
     int state;               /* != 0 if interned. In this case the two
                               * references from the dictionary to this
                               * object are *not* counted in ob_refcnt.
                               * See SSTATE_KIND_* for other bits */
     Py_ssize_t utf8_length;  /* Number of bytes in utf8, excluding the
                               * terminating \0. */
     char *utf8;              /* UTF-8 representation (null-terminated) */
     Py_ssize_t wstr_length;  /* Number of code points in wstr, possible
                               * surrogates count as two code points. */
     wchar_t *wstr;           /* wchar_t representation (null-terminated) */
} PyUnicodeObject;

Wouldn't the "normal" approach be to use a union for the str field? I.e.

     union str {
        unsigned char* latin1;
        Py_UCS2* ucs2;
        Py_UCS4* ucs4;
     }

Given that they're all pointers, all fields have the same size, but I find 
it more readable to write

     u.str.latin1

than

     ((const unsigned char*)u.str)

Plus, the three types would be given by the struct, rather than by a 
per-usage cast.

Has this been considered before? Was there a reason to decide against it?

Stefan