On 10. 06. 21 17:04, Antoine Pitrou wrote:
Separately from the builder API, there are cases where the data already exists somewhere as a full-blown UTF8 string (this is of course more and more common, since UTF8 is ubiquitous). There should be a fast conversion method from a UTF8 memory area to a unicode object.
Indeed. To me it seems that the analysis misses the importance of UTF-8.
CPython strings currently always have the {1,2,4}-byte "raw buffer"
representation. The UTF8 representation is computed when needed, and
*permanently stored* in the str object. This detail leaks to the API:
notice how PyUnicode_AsUTF8AndSize gives you a const char*
(tied to
the lifetime of the string object), while other codecs can only give you
PyBytes.
I can imagine a future where it could also go the other way: a string could only have a UTF8 representation stored at the start, and the {1,2,4}-byte one would only appear when needed (indexing/slicing, getting the length, etc.).
The analysis says that PyUnicode_FromStringAndSize is deprecated. Please check PEP 623 again: only the special case of calling it with NULL is deprecated. The function, which decodes from UTF-8, is very useful.
On 10. 06. 21 16:48, Antonio Cuni wrote:
Also, if you think that these kind of mails are off-topic in this ML, please let me know and I'll stop.
I think it's perfectly on topic. Thanks for sharing!