[Python-ideas] Processing surrogates in
Stephen J. Turnbull
stephen at xemacs.org
Sat May 16 15:50:49 CEST 2015
Andrew Barnert via Python-ideas writes:
> Python doesn't have a CESU-8 codec (or "JNI UTF-8" or any of the
> other near-equivalent abominations), right? Because IIRC, CESU-8
> says that (in Python terms) '\U00010400' and '\uD801\uDC00' should
> both encode to b'\xED\xA0\x81\xED\xB0\x80', instead of the former
> encoding to b'\xF0\x90\x90\x80' and the latter not being encodable
> because it's not a string.
It's ambiguous what the TR intends. It does say it encodes code
points, which would argue that '\uD801\uDC00' is encodable. However,
it also defines itself as a representation of UTF-16, and the
definition of the encoding itself states "Prior to transforming data
into CESU-8, supplementary characters must first be converted to their
surrogate pair UTF-16 representation." UTF-16's normative definition
defines it a Unicode transformation format, and therefore a UTF-16
stream cannot contain surrogates representing themselves, and there's
nothing in the document that refers to the possible interpretation of
surrogate code points as themselves.
So I agree with Steven that a str-to-bytes CESU-8 encoder should error
on any surrogates, and the decoder should error on surrogates not
encountered as a valid surrogate pair. Possibly you'd want special
error handlers that allow handling of the UTF-8 encoding of surrogates.
> Anyway, I don't know if that counts as a Unicode encoding, since
> it's only described in a TR, not the standard itself.
The TR specifically excludes it from the standard.
> And Python is probably right to ignore it (assuming I'm remembering
> right and Python does ignore it...), even if that makes problems
> for Jython or Oracle DB-API libs or whatever.
Why would it cause trouble for them? They're not going to use
byte-oriented functions to manipulate Unicode after going to all that
trouble to implement UTF-16 handling internally.
We're getting kinda far afield here, aren't we?
More information about the Python-ideas
mailing list