New GitHub issue #111089 from vstinner:<br>
<hr>
<pre>
I propose to change the `PyUnicode_AsUTF8()` API to raise an exception and return NULL if the string contains embedded null characters.
If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using `char*` since a null byte is treated as the terminator: marker of the string end. Truncating a string **silently** is a bad practice and can lead to different bugs including security vulnerabilities.
In practice, the minority of impacted C extensions and impacted users should **benefit** of such backward incompatible change, since truncating a string **silently** is a bad practice. Impacted users can use `PyUnicode_AsUTF8AndSize(obj, NULL)` and just ignore the size if they want to truncate **on purpose**.
It would address the following "hidden" comment on PyUnicode_AsUTF8():
> Use of this API is **DEPRECATED** since no size information can be
> extracted from the returned data.
PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.
In the recently added PyUnicode_EqualToUTF8(obj, str), *str* is treated as not equal if *obj* contains embedded null characters.
The folllowing functions already raise an exception if the string contains embedded null characters or bytes:
* PyUnicode_AsWideCharString()
* PyUnicode_EncodeLocale()
* PyUnicode_EncodeFSDefault()
* PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
* PyUnicode_DecodeFSDefaultAndSize()
* PyUnicode_FSConverter()
* PyUnicode_FSDecoder()
PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.
PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.
</pre>
<hr>
<a href="https://github.com/python/cpython/issues/111089">View on GitHub</a>
<p>Labels: topic-C-API</p>
<p>Assignee: </p>