[issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes
New submission from mbiggs <pythonbugs@doubleplum.net>: In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html It says the following: "UTF-8 has several convenient properties: (...) 2. A Unicode string is turned into a sequence of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes." This is not right. UTF-8 uses the zero byte to represent the Unicode codepoint U+0000 (the ASCII NULL character). This is a valid character in UTF-8 and is handled just fine by python's UTF-8 string encoding/decoding. ---------- assignee: docs@python components: Documentation messages: 341363 nosy: docs@python, mbiggs priority: normal severity: normal status: open title: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Andrew Svetlov <andrew.svetlov@gmail.com> added the comment: This is right for 99.99% cases: utf8 doesn't encode any character except explicit zero with zero bytes. UTF-16 for example encodes 'a' as b'\xff\xfea\x00' ---------- nosy: +asvetlov _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
mbiggs <pythonbugs@doubleplum.net> added the comment: So a correct statement would be "A UTF-8 string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the NULL character (U+0000)." I think it's important to correct this because the part about processing UTF-8 with C functions like strcpy(), was wrong and could cause bugs. ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: I agree that the documentation should be updated. Do you mind to create a pull request mbiggs? There are UTF-8 variants which guarantee that the encoded text has no zero bytes (see Modified UTF-8), but Python only provides the standard UTF-8 and UTF-8 with BOM. ---------- keywords: +easy nosy: +serhiy.storchaka stage: -> needs patch versions: -Python 3.5, Python 3.6 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Josh Rosenberg <shadowranger+python@gmail.com> added the comment: Minor bikeshed: If updating the documentation, refer to U+0000 as "the null character" or "NUL", not "NULL". Using "NULL" allows for confusion with NULL pointers; "the null character" (the name used in the Unicode standard) or "NUL" (the official three letter abbreviation in ASCII, Unicode too I think) has no such opportunity for confusion. ---------- nosy: +josh.r _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Change by Ezio Melotti <ezio.melotti@gmail.com>: ---------- nosy: +ezio.melotti type: -> enhancement _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Change by redshiftzero <jen@redshiftzero.com>: ---------- keywords: +patch pull_requests: +13026 stage: needs patch -> patch review _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Change by mbiggs <pythonbugs@doubleplum.net>: ---------- pull_requests: +13102 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
mbiggs <pythonbugs@doubleplum.net> added the comment: Ah sent a pull request but didn't realize that redshiftzero already had. Their PR looks good to me. Thanks for fixing this! ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Change by miss-islington <mariatta.wijaya+miss-islington@gmail.com>: ---------- pull_requests: +13294 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
Change by Cheryl Sabella <cheryl.sabella@gmail.com>: ---------- resolution: -> fixed stage: patch review -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue36789> _______________________________________
participants (8)
-
Andrew Svetlov -
Cheryl Sabella -
Ezio Melotti -
Josh Rosenberg -
mbiggs -
miss-islington -
redshiftzero -
Serhiy Storchaka