[Python-Dev] PEP 393: Special-casing ASCII-only strings

Thu Sep 15 17:50:41 CEST 2011

In reviewing memory usage, I found potential for saving more memory for
ASCII-only strings. Both Victor and Guido commented that something like
this be done; Antoine had asked whether there was anything that could
be done. Here is the idea:

In an ASCII-only string, the UTF-8 representation is shared with the
canonical one-byte representation. This would allow to drop the
UTF-8 pointer and the UTF-8 length field; instead, a flag in the state
would indicate that these fields are not there.

Likewise, the wchar_t/Py_UNICODE length can be shared (even though the
data cannot), since the ASCII-only string won't contain any surrogate
pairs.

To comply with the C aliasing rules, the structures would look like this:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     union {
         void *any;
         Py_UCS1 *latin1;
         Py_UCS2 *ucs2;
         Py_UCS4 *ucs4;
     } data;
     Py_hash_t hash;
     int state;     /* may include SSTATE_SHORT_ASCII flag */
     wchar_t *wstr;
} PyASCIIObject;

typedef struct {
     PyASCIIObject _base;
     Py_ssize_t utf8_length;
     char *utf8;
     Py_ssize_t wstr_length;
} PyUnicodeObject;

Code that directly accesses the structures would become more
complex; code that use the accessor macros wouldn't notice.

As a result, ASCII-only strings would lose three pointers,
and shrink to their 3.2 structure size. Since they also save
in the individual characters, strings with more than
3 characters (16-bit Py_UNICODE) or more than one character
(32-bit Py_UNICODE) would see a total size reduction compared
to 3.2.

Objects created throught the legacy API (PyUnicode_FromUnicode)
that are only later found to be ASCII-only (in PyUnicode_Ready)
would still have the UTF-8 pointer shared with the data pointer,
but keep including separate fields for pointer & size.

What do you think?

Regards,
Martin

P.S. There are similar reductions that could be applied
to the wstr_length in general: on 32-bit wchar_t systems,
it could be always dropped, on a 16-bit wchar_t system,
it could be dropped for UCS-2 strings. However, I'm not
proposing these, as I think the increase in complexity
is not worth the savings.