[Python-Dev] The future of the wchar_t cache
storchaka at gmail.com
Sat Oct 20 07:06:49 EDT 2018
Currently the PyUnicode object contains two caches: for UTF-8
representation and for wchar_t representation. They are needed not for
optimization but for supporting C API which returns borrowed references
for such representations.
The UTF-8 cache always was in unicode objects (but in Python 2 it was
not a UTF-8 cache, but a 8-bit representation cache). Initially it was
needed for compatibility with 8-bit str, for implementing the "s" and
"z" format units in PyArg_Parse(). Now it is used also for
PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().
The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the
former Py_UNICODE representation. Now Py_UNICODE is defined as an alias
of wchar_t, and the C API which returned a pointer to Py_UNICODE content
returns now a pointer to the cached wchar_t representation. It is the
"u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(),
PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().
All this increase the size of the unicode object. It includes the
constant overhead of additional pointer and size fields, and the
overhead of the cached representation proportional to the string length.
The following table contains number of bytes per character for different
kinds, with and without filling specified caches.
raw +utf8 +wchar_t +utf8+wchar_t
Windows Linux Windows Linux
ASCII 1 1 3 5 3 5
UCS1 1 2-3 3 5 4-5 6-7
UCS2 2 3-5 2 6 3-5 7-9
UCS4 4 5-8 6-8 4 7-12 5-8
There is also a new C API added in 3.3 for getting wchar_t
representation without using the cache: PyUnicode_AsWideChar() and
PyUnicode_AsWideCharString(). Currently it uses the cache, this has
benefits and disadvantages.
Old Py_UNICODE based API is deprecated, and will be removed eventually.
I want to ask about the future of the wchar_t cache. Is the benefit of
caching the wchar_t representation larger the disadvantage of spending
more memory? The wchar_t representation is so natural for Windows API as
the UTF8 representation for POSIX API. But in all other cases it is just
waste of memory. Are there reasons of keeping the wchar_t cache after
removing the deprecated API?
I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString().
They were implemented via the old Py_UNICODE based API, and now they
don't use deprecated functions. They still use the wchar_t cache if it
was created by previous use of the deprecated API, but don't create it
themselves. Is this the correct decision?
More information about the Python-Dev