2018-01-26 14:43 GMT+01:00 M.-A. Lemburg
If that's indeed being used as assumption, the docs must be fixed and PyUnicode_New() should verify this assumption as well - not only in debug builds using C asserts() :-)
As PyUnicode_FromStringAndSize(NULL, size), PyUnicode_New(size, maxchar) only allocates memory with uninitialized characters. I don't see how PyUnicode_New() could check the string content since the content is unknow yet... The new public C API added by PEP 393 is hard to use correctly, but they are the most efficient. Functions like PyUnicode_FromString() are simple to use and very hard to misuse :-) PyPy developers asked me to simply drop all these new public C API, make them private. At least, deprecate them. But I never looked in depth at the new API. I don't know if Cython uses it for example. Some APIs are still private like _PyUnicodeWriter which allows to create a string in multiple steps with a smart strategy to reduce or even avoid realloc() and conversions from the different storage types (UCS1, UCS2, UCS4). This API is very efficient, but also hard to use.
C extensions can easily create strings using PyUnicode_New() which do not adhere to such a requirement and then write arbitrary content using PyUnicode_WRITE(). In some cases, this may even be necessary, say in case the extension doesn't know what data is being written, reading it from some external source.
It would be a bug in the C extension.
I'm not too familiar with the new Unicode code, but it seems that this requirement is not checked everywhere, e.g. the resize code doesn't seem to have such checks either (only in debug versions).
It must be checked everywhere. If it's not the case, it's an obvious bug in CPython. If you spotted a bug, please report a bug ;-) Victor