[Python-Dev] PEP 393 review

Thu Aug 25 06:46:50 CEST 2011

Victor Stinner, 25.08.2011 00:29:
>> With this PEP, the unicode object overhead grows to 10 pointer-sized
>> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
>> Does it have any adverse effects?
>
> For pure ASCII, it might be possible to use a shorter struct:
>
> typedef struct {
>      PyObject_HEAD
>      Py_ssize_t length;
>      Py_hash_t hash;
>      int state;
>      Py_ssize_t wstr_length;
>      wchar_t *wstr;
>      /* no more utf8_length, utf8, str */
>      /* followed by ascii data */
> } _PyASCIIObject;
> (-2 pointer -1 ssize_t: 56 bytes)
>
> =>  "a" is 58 bytes (with utf8 for free, without wchar_t)
>
> For object allocated with the new API, we can use a shorter struct:
>
> typedef struct {
>      PyObject_HEAD
>      Py_ssize_t length;
>      Py_hash_t hash;
>      int state;
>      Py_ssize_t wstr_length;
>      wchar_t *wstr;
>      Py_ssize_t utf8_length;
>      char *utf8;
>      /* no more str pointer */
>      /* followed by latin1/ucs2/ucs4 data */
> } _PyNewUnicodeObject;
> (-1 pointer: 72 bytes)
>
> =>  "é" is 74 bytes (without utf8 / wchar_t)
>
> For the legacy API:
>
> typedef struct {
>      PyObject_HEAD
>      Py_ssize_t length;
>      Py_hash_t hash;
>      int state;
>      Py_ssize_t wstr_length;
>      wchar_t *wstr;
>      Py_ssize_t utf8_length;
>      char *utf8;
>      void *str;
> } _PyLegacyUnicodeObject;
> (same size: 80 bytes)
>
> =>  "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)
>
> The current struct:
>
> typedef struct {
>      PyObject_HEAD
>      Py_ssize_t length;
>      Py_UNICODE *str;
>      Py_hash_t hash;
>      int state;
>      PyObject *defenc;
> } PyUnicodeObject;
>
> =>  "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
> wchar_t)
>
> ... but the code (maybe only the macros?) and debuging will be more complex.

That's an interesting idea. However, it's not required to do this as part 
of the PEP 393 implementation. This can be added later on if the need 
evidently arises in general practice.

Also, there is always the possibility to simply intern very short strings 
in order to avoid their multiplication in memory. Long strings don't suffer 
from this as the data size quickly dominates. User code that works with a 
lot of short strings would likely do the same.

BTW, I would expect that many short strings either go away as quickly as 
they appeared (e.g. in a parser) or were brought in as literals and are 
therefore interned anyway. That's just one reason why I suggest to wait for 
a prove of inefficiency in the real world (and, obviously, to test your own 
code with this as quickly as possible).

>> Will the format codes returning a Py_UNICODE pointer with
>> PyArg_ParseTuple be deprecated?
>
> Because Python 2.x is still dominant and it's already hard enough to port C
> modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

Well, it will be quite inefficient in future CPython versions, so I think 
if it's not officially deprecated at some point, it will deprecate itself 
for efficiency reasons. Better make it clear that it's worth investing in 
better performance here.

>> Do you think the wstr representation could be removed in some future
>> version of Python?
>
> Conversion to wchar_t* is common, especially on Windows.

That's an issue. However, I cannot say how common this really is in 
practice. Surely depends on the specific code, right? How common is it in 
core CPython?

> But I don't know if
> we *have to* cache the result. Is it cached by the way? Or is wstr only used
> when a string is created from Py_UNICODE?

If it's so common on Windows, maybe it should only be cached there?

Stefan