[Python-ideas] Add "has_surrogates" flags to string object

Wed Oct 9 02:55:07 CEST 2013

On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote:
> On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> 
> > Given:
> >
> > c = '\N{LINEAR B SYLLABLE B038 E}'  # \U00010001
> > c.encode('utf-8')
> > => b'\xf0\x90\x80\x81'
> >
> > and:
> >
> > c.encode('utf-16BE')  # encodes as a surrogate pair
> > => b'\xd8\x00\xdc\x01'
> >
> > then those same surrogates, taken as codepoints, should be encodable as
> > UTF-8:
> >
> > '\ud800\udc01'.encode('utf-8')
> > => b'\xf0\x90\x80\x81'
> >
> >
> > I'd actually be disappointed if that were the case; I think that would
> > be a poor design. But if that's what the Unicode standard demands,
> > Python ought to support it.
> >
> 
> The FAQ is explicit that this is wrong: "The definition of UTF-8 requires
> that supplementary characters (those using surrogate pairs in UTF-16) be
> encoded with a single four byte sequence."
> http://www.unicode.org/faq/utf_bom.html#utf8-4

And if you count the number of bytes, you will see four of them:

'\ud800\udc01'.encode('utf-8')
=> b'\xf0' b'\x90' b'\x80' b'\x81'

I stress that Python 3.3 doesn't actually do this, but my reading of the 
FAQ suggests that it should.

The question isn't what UTF-8 should do with supplmentary characters 
(those outside the BMP). That is well-defined, and Python 3.3 gets it 
right. The question is what it should do with pairs of surrogates. 
Ill-formed surrogates are rightly illegal when encoding to UTF-8:

# a lone surrogate is illegal
'\ud800'.encode('utf-8') must be treated as an error

# two high surrogates, or two low surrogates
'\udc01\udc01'.encode('utf-8') must be treated as an error
'\ud800\ud800'.encode('utf-8') must be treated as an error

# if they're in the wrong order
'\udc01\ud800'.encode('utf-8') must be treated as an error

The only thing that I'm not sure is how to deal with *valid* 
pairs of surrogates:

'\ud800\udc01'.encode('utf-8') should do what?

I personally would hope that this too should raise, which is Python's 
current behaviour, but my reading of the FAQs is that it should be 
treated as if there were an implicit UTF-16 conversion. (I hope I'm 
wrong!) That is:

1) treat the sequence of code points as if it were a sequence of two 
16-bit values b'\xd8\x00' b'\xdc\x01'

2) implicitly decode it using UTF-16 to get U+10001

3) encode U+10001 using UTF-8 to get b'\xf0\x90\x80\x81'

That would be (in my opinion) *horrible*, but that's my reading of the 
Unicode FAQ. The question asks:

"How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8?"

and the answer seems to be:

"The definition of UTF-8 requires that supplementary characters (those 
using surrogate pairs in UTF-16) be encoded with a single four byte 
sequence."

which doesn't actually answer the question (the question is about 
SURROGATE PAIRS, the answer is about SUPPLEMENTARY CHARACTERS) but 
suggests the above horrible interpretation.

What I'm hoping for is a definite source that explains what the UTF-8 
encoder is supposed to do with a Unicode string containing surrogates.

(And presumably the other UTF encoders as well, although I haven't tried 
thinking about them yet.)

> It goes on to say that there is a widespread practice of doing it anyway in
> older software. Therefore, it might be acceptable to accept these
> mis-encoded characters when *decoding* but they should never be generated
> when *encoding*. 

They are talking about the practice of generating six bytes, two 
three-byte sequences. You should notice that I'm not generating six 
bytes anywhere.

-- 
Steven