[docs] [issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

mbiggs report at bugs.python.org
Fri May 3 20:00:17 EDT 2019


New submission from mbiggs <pythonbugs at doubleplum.net>:

In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html

It says the following:


"UTF-8 has several convenient properties:
(...)
2. A Unicode string is turned into a sequence of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes."

This is not right.  UTF-8 uses the zero byte to represent the Unicode codepoint U+0000 (the ASCII NULL character).  This is a valid character in UTF-8 and is handled just fine by python's UTF-8 string encoding/decoding.

----------
assignee: docs at python
components: Documentation
messages: 341363
nosy: docs at python, mbiggs
priority: normal
severity: normal
status: open
title: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes
versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue36789>
_______________________________________


More information about the docs mailing list