[Python-ideas] Add "has_surrogates" flags to string object

Stephen J. Turnbull stephen at xemacs.org
Wed Oct 9 06:29:04 CEST 2013


Steven D'Aprano writes:

 > What I'm hoping for is a definite source that explains what the UTF-8 
 > encoder is supposed to do with a Unicode string containing
 > surrogates.

According to PEP 383, which provides a special mechanism for
roundtripping input that claims to be a particular encoding but does
not conform to that encoding, when encoding to UTF-8, if the errors=
parameter *is* surrogateescape *and* the value is in the first row of
the low surrogate range, it is masked by 0xff and emitted as a single
byte.

In all other cases of surrogates, it should raise an error.  A
conforming Unicode codec must not emit UTF-8 which would decode to a
surrogate.  These cases can occur in valid Python programs because
chr() is unconstrained (for example).

On input, Unicode conformance means that when using the
surrogateescape handler, an alleged UTF-8 stream containing a 6-byte
sequence that would algorithmically decode to a surrogate pair should
be represented internally as a sequence of 6 surrogates from the first
row of the low surrogate range.  If the surrogateescape handler is not
in use, it should raise an error.

Sorry about not testing actual behavior, gotta run to a meeting.

I forget what PEP 383 says about other Unicode codecs.


More information about the Python-ideas mailing list