[Python-checkins] cpython (2.7): From RFC 3629 5- and 6-bytes UTF-8 sequences are invalid, so remove them from

ezio.melotti python-checkins at python.org
Thu Sep 1 07:19:25 CEST 2011


http://hg.python.org/cpython/rev/4d584ebbfa77
changeset:   72162:4d584ebbfa77
branch:      2.7
parent:      72153:4dcbae65df3f
user:        Ezio Melotti <ezio.melotti at gmail.com>
date:        Thu Sep 01 08:19:01 2011 +0300
summary:
  From RFC 3629 5- and 6-bytes UTF-8 sequences are invalid, so remove them from the doc.

files:
  Doc/library/codecs.rst |  9 ++-------
  1 files changed, 2 insertions(+), 7 deletions(-)


diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -811,7 +811,7 @@
 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
 parts: Marker bits (the most significant bits) and payload bits. The marker bits
-are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
+are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
 encoded like this (with x being payload bits, which when concatenated give the
 Unicode character):
 
@@ -824,12 +824,7 @@
 +-----------------------------------+----------------------------------------------+
 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
 +-----------------------------------+----------------------------------------------+
-| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
-+-----------------------------------+----------------------------------------------+
-| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
-+-----------------------------------+----------------------------------------------+
-| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
-|                                   | 10xxxxxx                                     |
+| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
 +-----------------------------------+----------------------------------------------+
 
 The least significant bit of the Unicode character is the rightmost x bit.

-- 
Repository URL: http://hg.python.org/cpython


More information about the Python-checkins mailing list