[Python-Dev] PEP 393: Special-casing ASCII-only strings
"Martin v. Löwis"
martin at v.loewis.de
Thu Sep 15 17:50:41 CEST 2011
In reviewing memory usage, I found potential for saving more memory for
ASCII-only strings. Both Victor and Guido commented that something like
this be done; Antoine had asked whether there was anything that could
be done. Here is the idea:
In an ASCII-only string, the UTF-8 representation is shared with the
canonical one-byte representation. This would allow to drop the
UTF-8 pointer and the UTF-8 length field; instead, a flag in the state
would indicate that these fields are not there.
Likewise, the wchar_t/Py_UNICODE length can be shared (even though the
data cannot), since the ASCII-only string won't contain any surrogate
pairs.
To comply with the C aliasing rules, the structures would look like this:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
Py_hash_t hash;
int state; /* may include SSTATE_SHORT_ASCII flag */
wchar_t *wstr;
} PyASCIIObject;
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyUnicodeObject;
Code that directly accesses the structures would become more
complex; code that use the accessor macros wouldn't notice.
As a result, ASCII-only strings would lose three pointers,
and shrink to their 3.2 structure size. Since they also save
in the individual characters, strings with more than
3 characters (16-bit Py_UNICODE) or more than one character
(32-bit Py_UNICODE) would see a total size reduction compared
to 3.2.
Objects created throught the legacy API (PyUnicode_FromUnicode)
that are only later found to be ASCII-only (in PyUnicode_Ready)
would still have the UTF-8 pointer shared with the data pointer,
but keep including separate fields for pointer & size.
What do you think?
Regards,
Martin
P.S. There are similar reductions that could be applied
to the wstr_length in general: on 32-bit wchar_t systems,
it could be always dropped, on a 16-bit wchar_t system,
it could be dropped for UCS-2 strings. However, I'm not
proposing these, as I think the increase in complexity
is not worth the savings.
More information about the Python-Dev
mailing list