[Python-ideas] Add "has_surrogates" flags to string object

Stephen J. Turnbull stephen at xemacs.org
Wed Oct 9 04:03:46 CEST 2013


Greg Ewing writes:
 > Bruce Leban wrote:
 > > The FAQ is explicit that this is wrong: "The definition of UTF-8 
 > > requires that supplementary characters (those using surrogate pairs in 
 > > UTF-16) be encoded with a single four byte 
 > > sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4
 > 
 > Python's internal string representation is not UTF-16, though,
 > so this doesn't apply directly.

It applies directly to Steven's examples, since they use .encode() and
.decode().

 > Seems to me it hinges on whether a pair of surrogate code
 > points appearing in a Python string are meant to represent
 > a single character or not.

Only (a subset of) low surrogates is valid in a Python string, so a
pair can't possibly respresent a supplementary character in UTF-16
encoding.



More information about the Python-ideas mailing list