[capi-sig] Re: HPy API design for bytes and unicode builders

10 Jun 2021 · *permanently stored*


      On 10. 06. 21 17:04, Antoine Pitrou wrote:
...
Separately from the builder API, there are cases where the data already
exists somewhere as a full-blown UTF8 string (this is of course more and
more common, since UTF8 is ubiquitous).  There should be a fast
conversion method from a UTF8 memory area to a unicode object.
Indeed. To me it seems that the analysis misses the importance of UTF-8.
CPython strings currently always have the {1,2,4}-byte "raw buffer"
representation. The UTF8 representation is computed when needed, and
*permanently stored* in the str object. This detail leaks to the API:
notice how PyUnicode_AsUTF8AndSize gives you a const char* (tied to
the lifetime of the string object), while other codecs can only give you
PyBytes.
I can imagine a future where it could also go the other way: a string
could only have a UTF8 representation stored at the start, and the
{1,2,4}-byte one would only appear when needed (indexing/slicing,
getting the length, etc.).
The analysis says that PyUnicode_FromStringAndSize is deprecated. Please
check PEP 623 again: only the special case of calling it with NULL is
deprecated. The function, which decodes from UTF-8, is very useful.
On 10. 06. 21 16:48, Antonio Cuni wrote:
...
Also, if you think that these kind of mails are off-topic in this ML,
please let me know and I'll stop.
I think it's perfectly on topic. Thanks for sharing!

[capi-sig] Re: HPy API design for bytes and unicode builders

Petr Viktorin