[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 16:20:09 CEST 2013

On Tue, Oct 08, 2013 at 03:48:18PM +0200, Masklinn wrote:
> On 2013-10-08, at 15:02 , Steven D'Aprano wrote:

> > py> surr.encode('utf-8')
> > Traceback (most recent call last):
> >  File "<stdin>", line 1, in <module>
> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in 
> > position 0: surrogates not allowed
> > 
> > as per the standard:
> > 
> > http://www.unicode.org/faq/utf_bom.html#utf8-5
> > 
> > I *think* you are supposed to be able to encode surrogate *pairs* to 
> > UTF-8, if I'm reading the FAQ correctly
> 
> I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4:
> 
> > there is a widespread practice of generating pairs of three byte
> > sequences in older software, especially software which pre-dates the
> > introduction of UTF-16 or that is interoperating with UTF-16
> > environments under particular constraints. Such an encoding is not
> > conformant to UTF-8 as defined.
> 
> Pairs of 3-byte sequences would be encoding each surrogate directly to
> UTF-8, whereas a single 4-byte sequence would be decoding the surrogate
> pair to a codepoint and encoding that codepoint to UTF-8. My reading
> of the FAQ makes the second interpretation the only valid one.

It's not that clear to me. I fear the Unicode FAQs don't distinguish 
between Unicode strings and bytes well enough for my liking :(

But for the record, my interpretion is that if you have a pair of code 
points constisting of the same values as a valid surrogate pair, you 
should be able to encode to UTF-8. To give a concrete example:

Given:

c = '\N{LINEAR B SYLLABLE B038 E}'  # \U00010001
c.encode('utf-8')
=> b'\xf0\x90\x80\x81'

and:

c.encode('utf-16BE')  # encodes as a surrogate pair
=> b'\xd8\x00\xdc\x01'

then those same surrogates, taken as codepoints, should be encodable as 
UTF-8:

'\ud800\udc01'.encode('utf-8')
=> b'\xf0\x90\x80\x81'

I'd actually be disappointed if that were the case; I think that would 
be a poor design. But if that's what the Unicode standard demands, 
Python ought to support it.

But hopefully somebody will explain to me why my interpretation is wrong 
:-)

[...]
> The FAQ reads a bit strangely, I think because it's written from the
> viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
> UTF-32 are transcoding from that. Which does not apply to CPython and
> the FSR.

Hmmm... well, that might explain it. If it's written by Java programmers 
for Java programmers, they may very well decide that having spent 20 
years trying to convince people that string != ASCII, they're now 
going to convince them that string == UTF-16 instead :/

> Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
> must not contain surrogate codes: a surrogate pair should have been
> decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a
> single surrogate should have been caught by the UTF-16 decoder and
> should have triggered the error handler at that point. A surrogate code
> in a CPython string means the string is corrupted[1].

I think that interpretation is a bit strong. I think it would be fair to 
say that CPython strings may contain surrogates, but you can't encode 
them to bytes using the UTFs. Nor are there any byte sequences that can 
be decoded to surrogates using the UTFs.

This essentially means that you can only get surrogates in a string 
using (e.g.) chr() or \u escapes, and you can't then encode them to 
bytes using UTF encodings.

> Surrogates *may* appear in binary data, while building a UTF-16
> bytestream by hand.

But there you're talking about bytes, not byte strings. Byte strings can 
contain any bytes you like :-)

-- 
Steven