[issue9738] Document the encoding of functions bytes arguments of the C API
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
New submission from STINNER Victor <victor.stinner@haypocalc.com>: Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples: - format of PyUnicode_FromFormat() should be encoded as ISO-8859-1 - filename of PyParser_ASTFromString() should be encoded as utf-8 - filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape) - 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used) Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names. It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft. I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"? ---------- assignee: docs@python components: Documentation, Interpreter Core, Unicode files: encodings.patch keywords: patch messages: 115339 nosy: docs@python, haypo priority: normal severity: normal status: open title: Document the encoding of functions bytes arguments of the C API versions: Python 3.2 Added file: http://bugs.python.org/file18705/encodings.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Éric Araujo <merwok@netwok.org> added the comment: I think either of these is correct: - a UTF-8-encoded string - a string encoded in UTF-8 ---------- nosy: +eric.araujo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Dave Malcolm <dmalcolm@redhat.com> added the comment:
I think either of these is correct: - a UTF-8-encoded string - a string encoded in UTF-8
Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type. Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot. (sorry for bikeshedding) ---------- nosy: +dmalcolm _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Terry J. Reedy <tjreedy@udel.edu> added the comment: Better specifying requirements is good. A few comments: - The second argument is an error message; it is converted to a string object. + The second argument is an error message; it is decoded to a string object + with ``'utf-8'`` encoding. I would write the change as + The second argument is a utf-8 encoded error message; it is decoded to a string object. I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant. ------------------------------- + a Python exception (class, not an instance). *format* should be a string + encoded to ISO-8859-1, containing format codes, *format* should be ISO-8859-1 encoded bytes containing format codes, although I am not clear about the implications of that. Are not all format code ascii chars? -------------------------------- I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used. ---------- nosy: +terry.reedy _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
STINNER Victor <victor.stinner@haypocalc.com> added the comment: About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue #9769. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
STINNER Victor <victor.stinner@haypocalc.com> added the comment: #6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Changes by Alexander Belopolsky <belopolsky@users.sourceforge.net>: ---------- nosy: +belopolsky _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Dave Malcolm <dmalcolm@redhat.com> added the comment: A (probably crazy) idea that just occurred to me: typedef char utf8_bytes; typedef char iso8859_1_bytes; typedef char fsenc_bytes; then specify the encoding in the type signature of the API e.g.: - int PyRun_SimpleFile(FILE *fp, const char *filename) + int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
A (probably crazy) idea that just occurred to me: typedef char utf8_bytes; typedef char iso8859_1_bytes; typedef char fsenc_bytes;
I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes. The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix. And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
STINNER Victor <victor.stinner@haypocalc.com> added the comment: r87504 documents encodings of error functions. r87505 documents encodings of unicode functions. r87506 documents encodings of AST, compiler, parser and PyRun functions. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
STINNER Victor <victor.stinner@haypocalc.com> added the comment: While documenting encodings, I found two issues: #10778 and #10779. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: Victor, Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
data:image/s3,"s3://crabby-images/07298/0729810f9d6e3de83579355e6316307daeae3658" alt=""
STINNER Victor <victor.stinner@haypocalc.com> added the comment:
Here is an interesting case for your collection: PyDict_GetItemString.
It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented. I documented many functions, directly in the header files, and sometimes also in the reST documentation. I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues. ---------- resolution: -> fixed status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9738> _______________________________________
participants (5)
-
Alexander Belopolsky
-
Dave Malcolm
-
STINNER Victor
-
Terry J. Reedy
-
Éric Araujo