[Python-ideas] Processing surrogates in

Sat May 16 15:50:49 CEST 2015

Andrew Barnert via Python-ideas writes:

 > Python doesn't have a CESU-8 codec (or "JNI UTF-8" or any of the
 > other near-equivalent abominations), right? Because IIRC, CESU-8
 > says that (in Python terms) '\U00010400' and '\uD801\uDC00' should
 > both encode to b'\xED\xA0\x81\xED\xB0\x80', instead of the former
 > encoding to b'\xF0\x90\x90\x80' and the latter not being encodable
 > because it's not a string.

It's ambiguous what the TR intends.  It does say it encodes code
points, which would argue that '\uD801\uDC00' is encodable.  However,
it also defines itself as a representation of UTF-16, and the
definition of the encoding itself states "Prior to transforming data
into CESU-8, supplementary characters must first be converted to their
surrogate pair UTF-16 representation."  UTF-16's normative definition
defines it a Unicode transformation format, and therefore a UTF-16
stream cannot contain surrogates representing themselves, and there's
nothing in the document that refers to the possible interpretation of
surrogate code points as themselves.

So I agree with Steven that a str-to-bytes CESU-8 encoder should error
on any surrogates, and the decoder should error on surrogates not
encountered as a valid surrogate pair.  Possibly you'd want special
error handlers that allow handling of the UTF-8 encoding of surrogates.

 > Anyway, I don't know if that counts as a Unicode encoding, since
 > it's only described in a TR, not the standard itself.

The TR specifically excludes it from the standard.

 > And Python is probably right to ignore it (assuming I'm remembering
 > right and Python does ignore it...), even if that makes problems
 > for Jython or Oracle DB-API libs or whatever.

Why would it cause trouble for them?  They're not going to use
byte-oriented functions to manipulate Unicode after going to all that
trouble to implement UTF-16 handling internally.

We're getting kinda far afield here, aren't we?