[Python-Dev] The future of the wchar_t cache

Sat Oct 20 07:06:49 EDT 2018

Currently the PyUnicode object contains two caches: for UTF-8 
representation and for wchar_t representation. They are needed not for 
optimization but for supporting C API which returns borrowed references 
for such representations.

The UTF-8 cache always was in unicode objects (but in Python 2 it was 
not a UTF-8 cache, but a 8-bit representation cache). Initially it was 
needed for compatibility with 8-bit str, for implementing the "s" and 
"z" format units in PyArg_Parse(). Now it is used also for 
PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().

The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the 
former Py_UNICODE representation. Now Py_UNICODE is defined as an alias 
of wchar_t, and the C API which returned a pointer to Py_UNICODE content 
returns now a pointer to the cached wchar_t representation. It is the 
"u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), 
PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), 
PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().

All this increase the size of the unicode object. It includes the 
constant overhead of additional pointer and size fields, and the 
overhead of the cached representation proportional to the string length. 
The following table contains number of bytes per character for different 
kinds, with and without filling specified caches.

        raw  +utf8     +wchar_t       +utf8+wchar_t
                    Windows  Linux   Windows  Linux
ASCII   1     1       3       5        3       5
UCS1    1    2-3      3       5       4-5     6-7
UCS2    2    3-5      2       6       3-5     7-9
UCS4    4    5-8     6-8      4       7-12    5-8

There is also a new C API added in 3.3 for getting wchar_t 
representation without using the cache: PyUnicode_AsWideChar() and 
PyUnicode_AsWideCharString(). Currently it uses the cache, this has 
benefits and disadvantages.

Old Py_UNICODE based API is deprecated, and will be removed eventually.
I want to ask about the future of the wchar_t cache. Is the benefit of 
caching the wchar_t representation larger the disadvantage of spending 
more memory? The wchar_t representation is so natural for Windows API as 
the UTF8 representation for POSIX API. But in all other cases it is just 
waste of memory. Are there reasons of keeping the wchar_t cache after 
removing the deprecated API?

I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). 
They were implemented via the old Py_UNICODE based API, and now they 
don't use deprecated functions. They still use the wchar_t cache if it 
was created by previous use of the deprecated API, but don't create it 
themselves. Is this the correct decision?

https://bugs.python.org/issue30863