[docs] Document that PyUnicode_AsUTF8() returns a null-terminated string (issue 23088)
vadmium+py at gmail.com
vadmium+py at gmail.com
Sun Mar 22 05:50:13 CET 2015
Reviewers: storchaka,
https://bugs.python.org/review/23088/diff/14165/Doc/c-api/bytes.rst
File Doc/c-api/bytes.rst (right):
https://bugs.python.org/review/23088/diff/14165/Doc/c-api/bytes.rst#newcode158
Doc/c-api/bytes.rst:158: If *length* is *NULL*, the string may not
contain embedded null characters;
On 2015/03/21 09:12:16, storchaka wrote:
> I think it is better to avoid words "string" (except may be a
combination "byte
> string") and "character" when say about the content of a bytes object.
Agreed.
> This documentation is mainly a copy of the documentation of Python 2
strings,
> and contains outdated and incorrect wording. It would be good to fix
this, but
> may be in other issue.
https://bugs.python.org/review/23088/diff/14165/Doc/c-api/unicode.rst
File Doc/c-api/unicode.rst (right):
https://bugs.python.org/review/23088/diff/14165/Doc/c-api/unicode.rst#newcode233
Doc/c-api/unicode.rst:233: The *o* argument has to be a Unicode object
(not checked).
On 2015/03/21 09:12:16, storchaka wrote:
> argument or parameter? What is correct?
I _think_ “argument” is slightly more correct, unless this is
inconsistent with the surrounding documentation. It is subtle but I
would tend to think of “the o parameter” as meaning the variable name or
a generic placeholder that holds the argument.
Please review this at https://bugs.python.org/review/23088/
Affected files:
Doc/c-api/bytearray.rst
Doc/c-api/bytes.rst
Doc/c-api/unicode.rst
# HG changeset patch
# Parent 3de678cd184d943f53e9bc0e74feefaa07cc7f55
Document that the UTF-8 representation is null-terminated
diff -r 3de678cd184d Doc/c-api/bytearray.rst
--- a/Doc/c-api/bytearray.rst Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/bytearray.rst Thu Mar 12 00:39:46 2015 +0000
@@ -64,7 +64,8 @@
.. c:function:: char* PyByteArray_AsString(PyObject *bytearray)
Return the contents of *bytearray* as a char array after checking for a
- *NULL* pointer.
+ *NULL* pointer. The returned array always has an extra
+ null byte appended, even when the array already contains null bytes.
.. c:function:: int PyByteArray_Resize(PyObject *bytearray, Py_ssize_t len)
diff -r 3de678cd184d Doc/c-api/bytes.rst
--- a/Doc/c-api/bytes.rst Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/bytes.rst Thu Mar 12 00:39:46 2015 +0000
@@ -136,8 +136,9 @@
.. c:function:: char* PyBytes_AsString(PyObject *o)
- Return a NUL-terminated representation of the contents of *o*. The pointer
- refers to the internal buffer of *o*, not a copy. The data must not be
+ Return the contents of *o*. The pointer refers to the internal
+ buffer of *o*, which is always terminated with an extra null byte,
+ even when the string already contains null bytes. The data must not be
modified in any way, unless the string was just created using
``PyBytes_FromStringAndSize(NULL, size)``. It must not be deallocated. If
*o* is not a string object at all, :c:func:`PyBytes_AsString` returns *NULL*
@@ -151,10 +152,10 @@
.. c:function:: int PyBytes_AsStringAndSize(PyObject *obj, char **buffer, Py_ssize_t *length)
- Return a NUL-terminated representation of the contents of the object *obj*
+ Return a null-terminated representation of the contents of the object *obj*
through the output variables *buffer* and *length*.
- If *length* is *NULL*, the resulting buffer may not contain NUL characters;
+ If *length* is *NULL*, the string may not contain embedded null characters;
if it does, the function returns ``-1`` and a :exc:`TypeError` is raised.
The buffer refers to an internal string buffer of *obj*, not a copy. The data
diff -r 3de678cd184d Doc/c-api/unicode.rst
--- a/Doc/c-api/unicode.rst Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/unicode.rst Thu Mar 12 00:39:46 2015 +0000
@@ -226,9 +226,11 @@
.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
const char* PyUnicode_AS_DATA(PyObject *o)
- Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The
- ``AS_DATA`` form casts the pointer to :c:type:`const char *`. *o* has to be
- a Unicode object (not checked).
+ Return a pointer to a :c:type:`Py_UNICODE` representation of the object.
+ The returned buffer is always terminated with an extra null character,
+ even when the string already contains null characters.
+ The ``AS_DATA`` form casts the pointer to :c:type:`const char *`.
+ The *o* argument has to be a Unicode object (not checked).
.. versionchanged:: 3.3
This macro is now inefficient -- because in many cases the
@@ -650,7 +652,9 @@
Copy the string *u* into a new UCS4 buffer that is allocated using
:c:func:`PyMem_Malloc`. If this fails, *NULL* is returned with a
- :exc:`MemoryError` set.
+ :exc:`MemoryError` set. The returned buffer always has an extra
+ null character appended, even if the string already contains
+ null characters.
.. versionadded:: 3.3
@@ -689,7 +693,8 @@
Return a read-only pointer to the Unicode object's internal
:c:type:`Py_UNICODE` buffer, or *NULL* on error. This will create the
:c:type:`Py_UNICODE*` representation of the object if it is not yet
- available. Note that the resulting :c:type:`Py_UNICODE` string may contain
+ available. The buffer is always terminated with an extra null character.
+ Note that the resulting :c:type:`Py_UNICODE` string may also contain
embedded null characters, which would cause the string to be truncated when
used in most C functions.
@@ -708,7 +713,8 @@
.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
- array length in *size*. Note that the resulting :c:type:`Py_UNICODE*` string
+ array length (excluding the extra null terminator) in *size*.
+ Note that the resulting :c:type:`Py_UNICODE*` string
may contain embedded null characters, which would cause the string to be
truncated when used in most C functions.
@@ -717,7 +723,7 @@
.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
- Create a copy of a Unicode string ending with a nul character. Return *NULL*
+ Create a copy of a Unicode string ending with a null character. Return *NULL*
and raise a :exc:`MemoryError` exception on memory allocation failure,
otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free
the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may
@@ -902,10 +908,10 @@
Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most
*size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
- 0-termination character). Return the number of :c:type:`wchar_t` characters
+ null termination character). Return the number of :c:type:`wchar_t` characters
copied or -1 in case of an error. Note that the resulting :c:type:`wchar_t*`
- string may or may not be 0-terminated. It is the responsibility of the caller
- to make sure that the :c:type:`wchar_t*` string is 0-terminated in case this is
+ string may or may not be null-terminated. It is the responsibility of the caller
+ to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
required by the application. Also, note that the :c:type:`wchar_t*` string
might contain null characters, which would cause the string to be truncated
when used with most C functions.
@@ -914,9 +920,9 @@
.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
Convert the Unicode object to a wide character string. The output string
- always ends with a nul character. If *size* is not *NULL*, write the number
- of wide characters (excluding the trailing 0-termination character) into
- *\*size*.
+ always ends with a null character. If *size* is not *NULL*, write the number
+ of wide characters (excluding the trailing null termination character)
+ into *\*size*.
Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
:c:func:`PyMem_Free` to free it) on success. On error, returns *NULL*,
@@ -1045,9 +1051,11 @@
.. c:function:: char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
- Return a pointer to the default encoding (UTF-8) of the Unicode object, and
- store the size of the encoded representation (in bytes) in *size*. *size*
- can be *NULL*, in this case no size will be stored.
+ Return a pointer to the UTF-8 encoding of the Unicode object, and
+ store the size of the encoded representation (in bytes) in *size*. The
+ *size* argument can be *NULL*; in this case no size will be stored. The
+ returned buffer always has an extra null byte appended (not included in
+ *size*), even if the string already contains null characters.
In the case of an error, *NULL* is returned with an exception set and no
*size* is stored.
More information about the docs
mailing list