[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 15:48:18 CEST 2013

On 2013-10-08, at 15:02 , Steven D'Aprano wrote:

[snipped early part as any response would be superseded by or redundant
with the stuff below]

> However, you cannot encode single surrogates to UTF-8:
> 
> py> surr.encode('utf-8')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in 
> position 0: surrogates not allowed
> 
> as per the standard:
> 
> http://www.unicode.org/faq/utf_bom.html#utf8-5
> 
> I *think* you are supposed to be able to encode surrogate *pairs* to 
> UTF-8, if I'm reading the FAQ correctly

I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4:

> there is a widespread practice of generating pairs of three byte
> sequences in older software, especially software which pre-dates the
> introduction of UTF-16 or that is interoperating with UTF-16
> environments under particular constraints. Such an encoding is not
> conformant to UTF-8 as defined.

Pairs of 3-byte sequences would be encoding each surrogate directly to
UTF-8, whereas a single 4-byte sequence would be decoding the surrogate
pair to a codepoint and encoding that codepoint to UTF-8. My reading
of the FAQ makes the second interpretation the only valid one.

So you can't encode surrogates (either lone or paired) to UTF-8,
you can encode the codepoint encoded by a surrogate pair.

> In any case, it is certainly legal to have Unicode strings 
> containing non-characters, including surrogates, and you can encode them 
> to UTF-16 and −32.

The UTF-32 section has similar note to UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf32-7

> A: If an unpaired surrogate is encountered when converting ill-formed
> UTF-16 data, any conformant converter must treat this as an error. By
> representing such an unpaired surrogate on its own, the resulting UTF-32
> data stream would become ill-formed. While it faithfully reflects the
> nature of the input, Unicode conformance requires that encoding form
> conversion always results in valid data stream.

and the UTF-16 section points out:
http://www.unicode.org/faq/utf_bom.html#utf16-7

> Q: Are there any 16-bit values that are invalid?

> A: Unpaired surrogates are invalid in UTFs. These include any value in
> the range D80016 to DBFF16 not followed by a value in the range DC0016
> to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a
> value in the range D80016 to DBFF16.

As far as I can read the FAQ, it is always invalid to encode a
surrogate, surrogates are not to be considered codepoints (they're not
just noncharacters[0], noncharacters are codepoints), and a lone
surrogate in a UTF-16 stream means the stream is corrupted, which should
result in an error during transcoding to anything (unless some recovery
mode is used to replace corrupted characters by some mark during
decoding I guess).

> So... I'm not sure why this will be useful. Presumably Unicode strings 
> containing surrogate code points will be rare

And they're a sign of corrupted stream.

The FAQ reads a bit strangely, I think because it's written from the
viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
UTF-32 are transcoding from that. Which does not apply to CPython and
the FSR.

Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
must not contain surrogate codes: a surrogate pair should have been
decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a
single surrogate should have been caught by the UTF-16 decoder and
should have triggered the error handler at that point. A surrogate code
in a CPython string means the string is corrupted[1].

Surrogates *may* appear in binary data, while building a UTF-16
bytestream by hand.

[0] since "noncharacter" has a well-defined meaning in unicode, and only
    applies to 66 codepoints, a much smaller range than surrogates:
    http://www.unicode.org/faq/private_use.html#noncharacters

[1] note that this hinges on my understanding of "UCS2" in FSR being
    actual UCS2, if it's UCS2-with-surrogates with a heuristic for
    switching between UCS2 and UCS4 depending on the number of
    surrogate pairs in the string it does not apply