[Python-ideas] Add "has_surrogates" flags to string object

Masklinn masklinn at masklinn.net
Tue Oct 8 16:40:58 CEST 2013


On 2013-10-08, at 16:20 , Steven D'Aprano wrote
> I'd actually be disappointed if that were the case; I think that would 
> be a poor design. But if that's what the Unicode standard demands, 
> Python ought to support it.

That would be really weird, it'd mean an *encoder* has to translate a
surrogate pair into the actual codepoint in some sort of weird
UTF-specific normalization pass.

> But hopefully somebody will explain to me why my interpretation is wrong 
> :-)
> 
> [...]
>> The FAQ reads a bit strangely, I think because it's written from the
>> viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
>> UTF-32 are transcoding from that. Which does not apply to CPython and
>> the FSR.
> 
> Hmmm... well, that might explain it. If it's written by Java programmers 
> for Java programmers, they may very well decide that having spent 20 
> years trying to convince people that string != ASCII, they're now 
> going to convince them that string == UTF-16 instead :/

To be fair, it's not just java programmers, IIRC ICU uses UTF-16 as the
internal encoding.

>> Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
>> must not contain surrogate codes: a surrogate pair should have been
>> decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a
>> single surrogate should have been caught by the UTF-16 decoder and
>> should have triggered the error handler at that point. A surrogate code
>> in a CPython string means the string is corrupted[1].
> 
> I think that interpretation is a bit strong. I think it would be fair to 
> say that CPython strings may contain surrogates, but you can't encode 
> them to bytes using the UTFs. Nor are there any byte sequences that can 
> be decoded to surrogates using the UTFs.
> 
> This essentially means that you can only get surrogates in a string 
> using (e.g.) chr() or \u escapes, and you can't then encode them to 
> bytes using UTF encodings.
> 
>> Surrogates *may* appear in binary data, while building a UTF-16
>> bytestream by hand.
> 
> But there you're talking about bytes, not byte strings. Byte strings can 
> contain any bytes you like :-)

Yes, that's basically what I mean: I think surrogates only make sense
in a bytestream, not in a unicode stream.

Although I did not remember/was not aware of PEP 383 (thank you Stephen)
which makes the Unicode spec irrelevant to what Python string contains.


On 2013-10-08, at 16:31 , Stephen J. Turnbull wrote:
> Furthermore, there is nothing to stop a Python unicode from containing
> any code unit (including both surrogates and other non-characters like
> 0xFFFF).  Checking of the rules you cite is done by codecs, at
> encoding and decoding time.

noncharacters are a very different case for what it's worth, their own
FAQ clearly notes that they are valid full-fledged codepoints and must
be encoded and preserved by UTFs:
 http://www.unicode.org/faq/private_use.html#nonchar7


More information about the Python-ideas mailing list