New GitHub issue #111089 from vstinner:<br>

<hr>

<pre>

I propose to change the `PyUnicode_AsUTF8()` API to raise an exception and return NULL if the string contains embedded null characters.

If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using `char*` since a null byte is treated as the terminator: marker of the string end. Truncating a string **silently** is a bad practice and can lead to different bugs including security vulnerabilities.

In practice, the minority of impacted C extensions and impacted users should **benefit** of such backward incompatible change, since truncating a string **silently** is a bad practice. Impacted users can use `PyUnicode_AsUTF8AndSize(obj, NULL)` and just ignore the size if they want to truncate **on purpose**.

It would address the following "hidden" comment on PyUnicode_AsUTF8():

> Use of this API is **DEPRECATED** since no size information can be

> extracted from the returned data.

PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.

In the recently added PyUnicode_EqualToUTF8(obj, str), *str* is treated as not equal if *obj* contains embedded null characters.

The folllowing functions already raise an exception if the string contains embedded null characters or bytes:

* PyUnicode_AsWideCharString()

* PyUnicode_EncodeLocale()

* PyUnicode_EncodeFSDefault()

* PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()

* PyUnicode_DecodeFSDefaultAndSize()

* PyUnicode_FSConverter()

* PyUnicode_FSDecoder()

PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.

PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.

</pre>

<hr>

<a href="https://github.com/python/cpython/issues/111089">View on GitHub</a>

<p>Labels: topic-C-API</p>

<p>Assignee: </p>