[Python-ideas] Processing surrogates in
Steven D'Aprano
steve at pearwood.info
Sat May 16 12:02:41 CEST 2015
On Sat, May 16, 2015 at 02:47:02AM -0700, Andrew Barnert via Python-ideas wrote:
> > The unique thing about the surrogate case is that *no* codec is
> > supposed to encode them, not even the universal ones:
>
> Python doesn't have a CESU-8 codec (or "JNI UTF-8" or any of the other
> near-equivalent abominations), right?
*shrug* Even if it doesn't, it's just a codec, not new syntax. Anyone
can create their own codecs. There probably are people who need CESU-8
for compatibility with other apps, and if the std lib can include
UTF-8-sig, it can probably include CESU-8. Or it can be left for those
who need it to implement it themselves.
> Because IIRC, CESU-8 says that
> (in Python terms) '\U00010400' and '\uD801\uDC00' should both encode
> to b'\xED\xA0\x81\xED\xB0\x80', instead of the former encoding to
> b'\xF0\x90\x90\x80' and the latter not being encodable because it's
> not a string.
Sounds about right as far as the first half goes:
http://unicode.org/reports/tr26/
As far as the second half goes, the TR doesn't say anything about
processing surrogate pairs in the source Unicode string. Since (strict)
Unicode strings cannot contain surrogates, I think that CESU-8 should
treat it as an error just like UTF-8. The TR does say:
CESU-8 defines an encoding scheme for Unicode identical to
UTF-8 except for its representation of supplementary characters.
That seems pretty clear to me: if '\uDC00'.encode('utf-8') raises an
error, then so should '\uDC00'.encode('cesu-8').
--
Steve
More information about the Python-ideas
mailing list