[Python-ideas] Add "has_surrogates" flags to string object

Greg Ewing greg.ewing at canterbury.ac.nz
Wed Oct 9 00:49:29 CEST 2013


Bruce Leban wrote:
> The FAQ is explicit that this is wrong: "The definition of UTF-8 
> requires that supplementary characters (those using surrogate pairs in 
> UTF-16) be encoded with a single four byte 
> sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4

Python's internal string representation is not UTF-16, though,
so this doesn't apply directly.

Seems to me it hinges on whether a pair of surrogate code
points appearing in a Python string are meant to represent
a single character or not. I would say not, because otherwise
they would have been stored as a single code unit.

-- 
Greg


More information about the Python-ideas mailing list