[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 16:31:25 CEST 2013

Masklinn writes:

 > The FAQ reads a bit strangely, I think because it's written from the
 > viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
 > UTF-32 are transcoding from that. Which does not apply to CPython and
 > the FSR.

No, it's written from the viewpoint that it says *nothing* about
internal encodings, only about the encodings used in interchange of
textual data, and about certain aspects of the processes that may
receive and generate such data (eg, when data matches a Unicode
regular expression, or how bidirectional text should appear visually).

 > Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
 > must not contain surrogate codes:

No, it says no such thing.  All the Unicode Standard (and the FAQ)
says is that if Python generates output that purports to be text
encoded in Unicode, it may not contain surrogate codes except where
those codes are used according to UTF-16 to encode characters in
planes 2 to 17, and if it receives data alleged to be Unicode in some
transformation format, it must raise an error if it receives
surrogates other than a correctly formed surrogate pair in text known
to be encoded as UTF-16.

In fact (as I wrote before without proper citation), the internal
encoding of Python has been extended by PEP 383 to use a subset of the
surrogate space to represent undecodable bytes in an octet stream,
when the error handler is set to "surrogateescape".

Furthermore, there is nothing to stop a Python unicode from containing
any code unit (including both surrogates and other non-characters like
0xFFFF).  Checking of the rules you cite is done by codecs, at
encoding and decoding time.