[Python-ideas] Add "has_surrogates" flags to string object
Masklinn
masklinn at masklinn.net
Tue Oct 8 16:40:58 CEST 2013
On 2013-10-08, at 16:20 , Steven D'Aprano wrote
> I'd actually be disappointed if that were the case; I think that would
> be a poor design. But if that's what the Unicode standard demands,
> Python ought to support it.
That would be really weird, it'd mean an *encoder* has to translate a
surrogate pair into the actual codepoint in some sort of weird
UTF-specific normalization pass.
> But hopefully somebody will explain to me why my interpretation is wrong
> :-)
>
> [...]
>> The FAQ reads a bit strangely, I think because it's written from the
>> viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
>> UTF-32 are transcoding from that. Which does not apply to CPython and
>> the FSR.
>
> Hmmm... well, that might explain it. If it's written by Java programmers
> for Java programmers, they may very well decide that having spent 20
> years trying to convince people that string != ASCII, they're now
> going to convince them that string == UTF-16 instead :/
To be fair, it's not just java programmers, IIRC ICU uses UTF-16 as the
internal encoding.
>> Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
>> must not contain surrogate codes: a surrogate pair should have been
>> decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a
>> single surrogate should have been caught by the UTF-16 decoder and
>> should have triggered the error handler at that point. A surrogate code
>> in a CPython string means the string is corrupted[1].
>
> I think that interpretation is a bit strong. I think it would be fair to
> say that CPython strings may contain surrogates, but you can't encode
> them to bytes using the UTFs. Nor are there any byte sequences that can
> be decoded to surrogates using the UTFs.
>
> This essentially means that you can only get surrogates in a string
> using (e.g.) chr() or \u escapes, and you can't then encode them to
> bytes using UTF encodings.
>
>> Surrogates *may* appear in binary data, while building a UTF-16
>> bytestream by hand.
>
> But there you're talking about bytes, not byte strings. Byte strings can
> contain any bytes you like :-)
Yes, that's basically what I mean: I think surrogates only make sense
in a bytestream, not in a unicode stream.
Although I did not remember/was not aware of PEP 383 (thank you Stephen)
which makes the Unicode spec irrelevant to what Python string contains.
On 2013-10-08, at 16:31 , Stephen J. Turnbull wrote:
> Furthermore, there is nothing to stop a Python unicode from containing
> any code unit (including both surrogates and other non-characters like
> 0xFFFF). Checking of the rules you cite is done by codecs, at
> encoding and decoding time.
noncharacters are a very different case for what it's worth, their own
FAQ clearly notes that they are valid full-fledged codepoints and must
be encoded and preserved by UTFs:
http://www.unicode.org/faq/private_use.html#nonchar7
More information about the Python-ideas
mailing list