PEP 393: Special-casing ASCII-only strings
In reviewing memory usage, I found potential for saving more memory for ASCII-only strings. Both Victor and Guido commented that something like this be done; Antoine had asked whether there was anything that could be done. Here is the idea: In an ASCII-only string, the UTF-8 representation is shared with the canonical one-byte representation. This would allow to drop the UTF-8 pointer and the UTF-8 length field; instead, a flag in the state would indicate that these fields are not there. Likewise, the wchar_t/Py_UNICODE length can be shared (even though the data cannot), since the ASCII-only string won't contain any surrogate pairs. To comply with the C aliasing rules, the structures would look like this: typedef struct { PyObject_HEAD Py_ssize_t length; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Py_hash_t hash; int state; /* may include SSTATE_SHORT_ASCII flag */ wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyUnicodeObject; Code that directly accesses the structures would become more complex; code that use the accessor macros wouldn't notice. As a result, ASCII-only strings would lose three pointers, and shrink to their 3.2 structure size. Since they also save in the individual characters, strings with more than 3 characters (16-bit Py_UNICODE) or more than one character (32-bit Py_UNICODE) would see a total size reduction compared to 3.2. Objects created throught the legacy API (PyUnicode_FromUnicode) that are only later found to be ASCII-only (in PyUnicode_Ready) would still have the UTF-8 pointer shared with the data pointer, but keep including separate fields for pointer & size. What do you think? Regards, Martin P.S. There are similar reductions that could be applied to the wstr_length in general: on 32-bit wchar_t systems, it could be always dropped, on a 16-bit wchar_t system, it could be dropped for UCS-2 strings. However, I'm not proposing these, as I think the increase in complexity is not worth the savings.
On 9/15/2011 11:50 AM, "Martin v. Löwis" wrote:
To comply with the C aliasing rules, the structures would look like this:
typedef struct { PyObject_HEAD Py_ssize_t length; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Py_hash_t hash; int state; /* may include SSTATE_SHORT_ASCII flag */ wchar_t *wstr; } PyASCIIObject;
typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyUnicodeObject;
Code that directly accesses the structures would become more complex; code that use the accessor macros wouldn't notice. ... What do you think?
That nearly all code outside CPython itself should treat the unicode types, especially, as opaque types and only access instances through functions and macros -- the 'public' interfaces. We need to be free to fiddle with internal implementation details as experience suggests changes.
P.S. There are similar reductions that could be applied to the wstr_length in general: on 32-bit wchar_t systems, it could be always dropped, on a 16-bit wchar_t system, it could be dropped for UCS-2 strings. However, I'm not proposing these, as I think the increase in complexity is not worth the savings.
I would certainly do just the one change now and see how it goes. I think you should be free to do more like the above if you change your mind with experience. -- Terry Jan Reedy
On Thu, Sep 15, 2011 at 8:50 AM, "Martin v. Löwis"
In reviewing memory usage, I found potential for saving more memory for ASCII-only strings. Both Victor and Guido commented that something like this be done; Antoine had asked whether there was anything that could be done. Here is the idea:
In an ASCII-only string, the UTF-8 representation is shared with the canonical one-byte representation. This would allow to drop the UTF-8 pointer and the UTF-8 length field; instead, a flag in the state would indicate that these fields are not there.
Likewise, the wchar_t/Py_UNICODE length can be shared (even though the data cannot), since the ASCII-only string won't contain any surrogate pairs.
To comply with the C aliasing rules, the structures would look like this:
typedef struct { PyObject_HEAD Py_ssize_t length; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Py_hash_t hash; int state; /* may include SSTATE_SHORT_ASCII flag */ wchar_t *wstr; } PyASCIIObject;
typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyUnicodeObject;
Code that directly accesses the structures would become more complex; code that use the accessor macros wouldn't notice.
As a result, ASCII-only strings would lose three pointers, and shrink to their 3.2 structure size. Since they also save in the individual characters, strings with more than 3 characters (16-bit Py_UNICODE) or more than one character (32-bit Py_UNICODE) would see a total size reduction compared to 3.2.
Objects created throught the legacy API (PyUnicode_FromUnicode) that are only later found to be ASCII-only (in PyUnicode_Ready) would still have the UTF-8 pointer shared with the data pointer, but keep including separate fields for pointer & size.
What do you think?
Regards, Martin
P.S. There are similar reductions that could be applied to the wstr_length in general: on 32-bit wchar_t systems, it could be always dropped, on a 16-bit wchar_t system, it could be dropped for UCS-2 strings. However, I'm not proposing these, as I think the increase in complexity is not worth the savings.
This sounds like a good plan. -- --Guido van Rossum (python.org/~guido)
Le jeudi 15 septembre 2011 17:50:41, Martin v. Löwis a écrit :
In reviewing memory usage, I found potential for saving more memory for ASCII-only strings. (...)
typedef struct { PyObject_HEAD Py_ssize_t length; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Py_hash_t hash; int state; /* may include SSTATE_SHORT_ASCII flag */ wchar_t *wstr; } PyASCIIObject;
I like it. If we start which such optimization, we can also also remove data from strings allocated by the new API (it can be computed: object pointer + size of the structure). See my email for my proposition of structures: Re: [Python-Dev] PEP 393 review Thu Aug 25 00:29:19 2011 You may reorganize fields to be able to cast PyUnicodeObject to PyASCIIObject. Victor
I like it. If we start which such optimization, we can also also remove data from strings allocated by the new API (it can be computed: object pointer + size of the structure). See my email for my proposition of structures: Re: [Python-Dev] PEP 393 review Thu Aug 25 00:29:19 2011
I agree it is tempting to drop the data pointer. However, I'm not sure how many different structures we would end up with, and how the aliasing rules would defeat this (you cannot interpret a struct X* as a struct Y*, unless either X is the first field of Y or vice versa). Thinking about this, the following may work: - ASCIIObject: state, length, hash, wstr*, data follow - SingleBlockUnicode: ASCIIObject, wstr_len, utf8*, utf8_len, data follow - UnicodeObject: SingleBlockUnicode, data pointer, no data follow This is essentially your proposal, except that the wstr_len is dropped for ASCII strings, and that it uses nested structs. The single-block variants would always be "ready", the full unicode object is ready only if the data pointer is set. I'll try it out, unless somebody can punch a hole into this proposal :-) Regards, Martin
On Fri, Sep 16, 2011 at 7:39 AM, "Martin v. Löwis"
Thinking about this, the following may work: - ASCIIObject: state, length, hash, wstr*, data follow - SingleBlockUnicode: ASCIIObject, wstr_len, utf8*, utf8_len, data follow - UnicodeObject: SingleBlockUnicode, data pointer, no data follow
This is essentially your proposal, except that the wstr_len is dropped for ASCII strings, and that it uses nested structs.
The single-block variants would always be "ready", the full unicode object is ready only if the data pointer is set.
In your "UnicodeObject" here, is the 'data pointer' the any/latin1/ucs2/ucs4 union from the original structure definition? Also, what are the constraints on the "SingleBlockUnicode"? Does it only hold strings that can be represented in latin1? Or can the size of the individual elements be more than 1 byte? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Am 16.09.11 00:42, schrieb Nick Coghlan:
On Fri, Sep 16, 2011 at 7:39 AM, "Martin v. Löwis
wrote: Thinking about this, the following may work:
- ASCIIObject: state, length, hash, wstr*, data follow
- SingleBlockUnicode: ASCIIObject, wstr_len, utf8*, utf8_len, data follow
- UnicodeObject: SingleBlockUnicode, data pointer, no data follow
This is essentially your proposal, except that the wstr_len is dropped for ASCII strings, and that it uses nested structs.
The single-block variants would always be "ready", the full unicode object is ready only if the data pointer is set.
In your "UnicodeObject" here, is the 'data pointer' the any/latin1/ucs2/ucs4 union from the original structure definition?
Yes, it is. I'm considering dropping the union again, since you'll have to cast the data pointer anyway in the compact cases.
Also, what are the constraints on the "SingleBlockUnicode"? Does it only hold strings that can be represented in latin1? Or can the size of the individual elements be more than 1 byte?
Any size - what matters is whether the maximum character is known at creation time (i.e. whether you've used PyUnicode_New(size, maxchar) or PyUnicode_FromUnicode(NULL, size)). In the latter case, a Py_UNICODE block will be allocated in wstr, and the data pointer left NULL. Then, when PyUnicode_Ready is called, the maxmimum character is determined in the Py_UNICODE block, and a new data block allocated - but that will have to be a second memory block (the Py_UNICODE block is then dropped in _Ready). Regards, Martin
participants (5)
-
"Martin v. Löwis"
-
Guido van Rossum
-
Nick Coghlan
-
Terry Reedy
-
Victor Stinner