[Python-ideas] Add "has_surrogates" flags to string object
Terry Reedy
tjreedy at udel.edu
Wed Oct 9 04:43:54 CEST 2013
On 10/8/2013 8:55 PM, Steven D'Aprano wrote:
> '\ud800\udc01'.encode('utf-8')
> => b'\xf0' b'\x90' b'\x80' b'\x81'
>
> I stress that Python 3.3 doesn't actually do this, but my reading of the
> FAQ suggests that it should.
And I already explained on python-list why that reading is wrong;
transcoding a utf-16 string (sequence of 2-byte words, subject to
validity rules) is different from encoding unicode text (character
sequence, and surrogates are not characters). A utf-16 to utf-8
transcoder should (must) do the above, but in 3.3+, the utf-8 codec is
no longer the utf-16 trancoder that it effectively was for narrow builds.
Each utf form defines a one to one mapping between unicode texts and
valid code unit sequences. (Unicode Standard, Chapter 3, definition
D79.) Having both '\U00010001' and '\ud800\udc01' map to
b'\xf0\x90\x80\x81' would violate that important property.
'\ud800\udc01' represents a character in utf-16 but not in python's
flexible string representation. The latter uses one code unit (of
variable size per string) per character, instead of a variable number of
code units (of one size for all strings) per character.
Because machines have not conceptual, visual, or aural memory, but only
byte memory, they must re-encode abstract characters to bytes to
remember them. In pre 3.3 narrow builds, where utf-16 was used
internally, decoding and encoding amounted to transcoding bytes
encodings into the utf-16 encoding, and vice versa. So utf-8
b'\xf0\x90\x80\x81' and utf-16 '\ud800\udc01' were mapped into each
other. Whether the mapping was done directly or indirectly, via the
character codepoint value, did not matter to the user.
In any case FSR no longer uses multiple-code-unit encodings internally,
and '\ud800\udc01', even though allowed for practical reasons, does not
represent and is not the same as '\U00010001'. The proposed
'has_surrogates' flag amounts to an 'not strictly valid' flag. Only the
FSR implementors can decide if it is worth the trouble.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list