[Python-ideas] Add "has_surrogates" flags to string object
Greg Ewing
greg.ewing at canterbury.ac.nz
Wed Oct 9 00:49:29 CEST 2013
Bruce Leban wrote:
> The FAQ is explicit that this is wrong: "The definition of UTF-8
> requires that supplementary characters (those using surrogate pairs in
> UTF-16) be encoded with a single four byte
> sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4
Python's internal string representation is not UTF-16, though,
so this doesn't apply directly.
Seems to me it hinges on whether a pair of surrogate code
points appearing in a Python string are meant to represent
a single character or not. I would say not, because otherwise
they would have been stored as a single code unit.
--
Greg
More information about the Python-ideas
mailing list