[Python-ideas] Add "has_surrogates" flags to string object
Stephen J. Turnbull
stephen at xemacs.org
Wed Oct 9 04:03:46 CEST 2013
Greg Ewing writes:
> Bruce Leban wrote:
> > The FAQ is explicit that this is wrong: "The definition of UTF-8
> > requires that supplementary characters (those using surrogate pairs in
> > UTF-16) be encoded with a single four byte
> > sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4
>
> Python's internal string representation is not UTF-16, though,
> so this doesn't apply directly.
It applies directly to Steven's examples, since they use .encode() and
.decode().
> Seems to me it hinges on whether a pair of surrogate code
> points appearing in a Python string are meant to represent
> a single character or not.
Only (a subset of) low surrogates is valid in a Python string, so a
pair can't possibly respresent a supplementary character in UTF-16
encoding.
More information about the Python-ideas
mailing list