[Python-ideas] Add "has_surrogates" flags to string object

Terry Reedy tjreedy at udel.edu
Wed Oct 9 04:43:54 CEST 2013


On 10/8/2013 8:55 PM, Steven D'Aprano wrote:

> '\ud800\udc01'.encode('utf-8')
> => b'\xf0' b'\x90' b'\x80' b'\x81'
>
> I stress that Python 3.3 doesn't actually do this, but my reading of the
> FAQ suggests that it should.

And I already explained on python-list why that reading is wrong; 
transcoding a utf-16 string (sequence of 2-byte words, subject to 
validity rules) is different from encoding unicode text (character 
sequence, and surrogates are not characters). A utf-16 to utf-8 
transcoder should (must) do the above, but in 3.3+, the utf-8 codec is 
no longer the utf-16 trancoder that it effectively was for narrow builds.

Each utf form defines a one to one mapping between unicode texts and 
valid code unit sequences. (Unicode Standard, Chapter 3, definition 
D79.) Having both '\U00010001' and '\ud800\udc01' map to 
b'\xf0\x90\x80\x81' would violate that important property. 
'\ud800\udc01' represents a character in utf-16 but not in python's 
flexible string representation. The latter uses one code unit (of 
variable size per string) per character, instead of a variable number of 
code units (of one size for all strings) per character.

Because machines have not conceptual, visual, or aural memory, but only 
byte memory, they must re-encode abstract characters to bytes to 
remember them. In pre 3.3 narrow builds, where utf-16 was used 
internally, decoding and encoding amounted to transcoding bytes 
encodings into the utf-16 encoding, and vice versa. So utf-8 
b'\xf0\x90\x80\x81' and utf-16 '\ud800\udc01' were mapped into each 
other. Whether the mapping was done directly or indirectly, via the 
character codepoint value, did not matter to the user.

In any case FSR no longer uses multiple-code-unit encodings internally, 
and '\ud800\udc01', even though allowed for practical reasons, does not 
represent and is not the same as '\U00010001'. The proposed 
'has_surrogates' flag amounts to an 'not strictly valid' flag. Only the 
FSR implementors can decide if it is worth the trouble.

-- 
Terry Jan Reedy



More information about the Python-ideas mailing list