[Python-Dev] PEP 393: Special-casing ASCII-only strings

Thu Sep 15 20:46:11 CEST 2011

On 9/15/2011 11:50 AM, "Martin v. Löwis" wrote:

> To comply with the C aliasing rules, the structures would look like this:
>
> typedef struct {
> PyObject_HEAD
> Py_ssize_t length;
> union {
> void *any;
> Py_UCS1 *latin1;
> Py_UCS2 *ucs2;
> Py_UCS4 *ucs4;
> } data;
> Py_hash_t hash;
> int state; /* may include SSTATE_SHORT_ASCII flag */
> wchar_t *wstr;
> } PyASCIIObject;
>
>
> typedef struct {
> PyASCIIObject _base;
> Py_ssize_t utf8_length;
> char *utf8;
> Py_ssize_t wstr_length;
> } PyUnicodeObject;
>
> Code that directly accesses the structures would become more
> complex; code that use the accessor macros wouldn't notice.
...
> What do you think?

That nearly all code outside CPython itself should treat the unicode 
types, especially, as opaque types and only access instances through 
functions and macros -- the 'public' interfaces. We need to be free to 
fiddle with internal implementation details as experience suggests changes.

> P.S. There are similar reductions that could be applied
> to the wstr_length in general: on 32-bit wchar_t systems,
> it could be always dropped, on a 16-bit wchar_t system,
> it could be dropped for UCS-2 strings. However, I'm not
> proposing these, as I think the increase in complexity
> is not worth the savings.

I would certainly do just the one change now and see how it goes. I 
think you should be free to do more like the above if you change your 
mind with experience.

-- 
Terry Jan Reedy