[docs] Document that PyUnicode_AsUTF8() returns a null-terminated string (issue 23088)

Sun Mar 22 05:50:13 CET 2015

Reviewers: storchaka,


https://bugs.python.org/review/23088/diff/14165/Doc/c-api/bytes.rst
File Doc/c-api/bytes.rst (right):

https://bugs.python.org/review/23088/diff/14165/Doc/c-api/bytes.rst#newcode158
Doc/c-api/bytes.rst:158: If *length* is *NULL*, the string may not
contain embedded null characters;
On 2015/03/21 09:12:16, storchaka wrote:
> I think it is better to avoid words "string" (except may be a
combination "byte
> string") and "character" when say about the content of a bytes object.

Agreed.

> This documentation is mainly a copy of the documentation of Python 2
strings,
> and contains outdated and incorrect wording. It would be good to fix
this, but
> may be in other issue.

https://bugs.python.org/review/23088/diff/14165/Doc/c-api/unicode.rst
File Doc/c-api/unicode.rst (right):

https://bugs.python.org/review/23088/diff/14165/Doc/c-api/unicode.rst#newcode233
Doc/c-api/unicode.rst:233: The *o* argument has to be a Unicode object
(not checked).
On 2015/03/21 09:12:16, storchaka wrote:
> argument or parameter? What is correct?

I _think_ “argument” is slightly more correct, unless this is
inconsistent with the surrounding documentation. It is subtle but I
would tend to think of “the o parameter” as meaning the variable name or
a generic placeholder that holds the argument.



Please review this at https://bugs.python.org/review/23088/

Affected files:
  Doc/c-api/bytearray.rst
  Doc/c-api/bytes.rst
  Doc/c-api/unicode.rst


# HG changeset patch
# Parent 3de678cd184d943f53e9bc0e74feefaa07cc7f55
Document that the UTF-8 representation is null-terminated

diff -r 3de678cd184d Doc/c-api/bytearray.rst

--- a/Doc/c-api/bytearray.rst	Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/bytearray.rst	Thu Mar 12 00:39:46 2015 +0000
@@ -64,7 +64,8 @@
 .. c:function:: char* PyByteArray_AsString(PyObject *bytearray)
 
    Return the contents of *bytearray* as a char array after checking for a
-   *NULL* pointer.
+   *NULL* pointer.  The returned array always has an extra
+   null byte appended, even when the array already contains null bytes.
 
 
 .. c:function:: int PyByteArray_Resize(PyObject *bytearray, Py_ssize_t len)
diff -r 3de678cd184d Doc/c-api/bytes.rst
--- a/Doc/c-api/bytes.rst	Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/bytes.rst	Thu Mar 12 00:39:46 2015 +0000
@@ -136,8 +136,9 @@
 
 .. c:function:: char* PyBytes_AsString(PyObject *o)
 
-   Return a NUL-terminated representation of the contents of *o*.  The pointer
-   refers to the internal buffer of *o*, not a copy.  The data must not be
+   Return the contents of *o*.  The pointer refers to the internal
+   buffer of *o*, which is always terminated with an extra null byte,
+   even when the string already contains null bytes.  The data must not be
    modified in any way, unless the string was just created using
    ``PyBytes_FromStringAndSize(NULL, size)``. It must not be deallocated.  If
    *o* is not a string object at all, :c:func:`PyBytes_AsString` returns *NULL*
@@ -151,10 +152,10 @@
 
 .. c:function:: int PyBytes_AsStringAndSize(PyObject *obj, char **buffer, Py_ssize_t *length)
 
-   Return a NUL-terminated representation of the contents of the object *obj*
+   Return a null-terminated representation of the contents of the object *obj*
    through the output variables *buffer* and *length*.
 
-   If *length* is *NULL*, the resulting buffer may not contain NUL characters;
+   If *length* is *NULL*, the string may not contain embedded null characters;
    if it does, the function returns ``-1`` and a :exc:`TypeError` is raised.
 
    The buffer refers to an internal string buffer of *obj*, not a copy. The data
diff -r 3de678cd184d Doc/c-api/unicode.rst
--- a/Doc/c-api/unicode.rst	Thu Dec 18 23:47:55 2014 +0100
+++ b/Doc/c-api/unicode.rst	Thu Mar 12 00:39:46 2015 +0000
@@ -226,9 +226,11 @@
 .. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
                 const char* PyUnicode_AS_DATA(PyObject *o)
 
-   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
-   ``AS_DATA`` form casts the pointer to :c:type:`const char *`.  *o* has to be
-   a Unicode object (not checked).
+   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.
+   The returned buffer is always terminated with an extra null character,
+   even when the string already contains null characters.
+   The ``AS_DATA`` form casts the pointer to :c:type:`const char *`.
+   The *o* argument has to be a Unicode object (not checked).
 
    .. versionchanged:: 3.3
       This macro is now inefficient -- because in many cases the
@@ -650,7 +652,9 @@
 
    Copy the string *u* into a new UCS4 buffer that is allocated using
    :c:func:`PyMem_Malloc`.  If this fails, *NULL* is returned with a
-   :exc:`MemoryError` set.
+   :exc:`MemoryError` set.  The returned buffer always has an extra
+   null character appended, even if the string already contains
+   null characters.
 
    .. versionadded:: 3.3
 
@@ -689,7 +693,8 @@
    Return a read-only pointer to the Unicode object's internal
    :c:type:`Py_UNICODE` buffer, or *NULL* on error. This will create the
    :c:type:`Py_UNICODE*` representation of the object if it is not yet
-   available. Note that the resulting :c:type:`Py_UNICODE` string may contain
+   available. The buffer is always terminated with an extra null character.
+   Note that the resulting :c:type:`Py_UNICODE` string may also contain
    embedded null characters, which would cause the string to be truncated when
    used in most C functions.
 
@@ -708,7 +713,8 @@
 .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
 
    Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
-   array length in *size*. Note that the resulting :c:type:`Py_UNICODE*` string
+   array length (excluding the extra null terminator) in *size*.
+   Note that the resulting :c:type:`Py_UNICODE*` string
    may contain embedded null characters, which would cause the string to be
    truncated when used in most C functions.
 
@@ -717,7 +723,7 @@
 
 .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
 
-   Create a copy of a Unicode string ending with a nul character. Return *NULL*
+   Create a copy of a Unicode string ending with a null character. Return *NULL*
    and raise a :exc:`MemoryError` exception on memory allocation failure,
    otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free
    the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may
@@ -902,10 +908,10 @@
 
    Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
    *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
-   0-termination character).  Return the number of :c:type:`wchar_t` characters
+   null termination character).  Return the number of :c:type:`wchar_t` characters
    copied or -1 in case of an error.  Note that the resulting :c:type:`wchar_t*`
-   string may or may not be 0-terminated.  It is the responsibility of the caller
-   to make sure that the :c:type:`wchar_t*` string is 0-terminated in case this is
+   string may or may not be null-terminated.  It is the responsibility of the caller
+   to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
    required by the application. Also, note that the :c:type:`wchar_t*` string
    might contain null characters, which would cause the string to be truncated
    when used with most C functions.
@@ -914,9 +920,9 @@
 .. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
 
    Convert the Unicode object to a wide character string. The output string
-   always ends with a nul character. If *size* is not *NULL*, write the number
-   of wide characters (excluding the trailing 0-termination character) into
-   *\*size*.
+   always ends with a null character. If *size* is not *NULL*, write the number
+   of wide characters (excluding the trailing null termination character)
+   into *\*size*.
 
    Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
    :c:func:`PyMem_Free` to free it) on success. On error, returns *NULL*,
@@ -1045,9 +1051,11 @@
 
 .. c:function:: char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
 
-   Return a pointer to the default encoding (UTF-8) of the Unicode object, and
-   store the size of the encoded representation (in bytes) in *size*.  *size*
-   can be *NULL*, in this case no size will be stored.
+   Return a pointer to the UTF-8 encoding of the Unicode object, and
+   store the size of the encoded representation (in bytes) in *size*.  The
+   *size* argument can be *NULL*; in this case no size will be stored.  The
+   returned buffer always has an extra null byte appended (not included in
+   *size*), even if the string already contains null characters.
 
    In the case of an error, *NULL* is returned with an exception set and no
    *size* is stored.