[Python-ideas] Processing surrogates in

Sat May 16 12:02:41 CEST 2015

On Sat, May 16, 2015 at 02:47:02AM -0700, Andrew Barnert via Python-ideas wrote:

> > The unique thing about the surrogate case is that *no* codec is
> > supposed to encode them, not even the universal ones:
> 
> Python doesn't have a CESU-8 codec (or "JNI UTF-8" or any of the other 
> near-equivalent abominations), right? 

*shrug* Even if it doesn't, it's just a codec, not new syntax. Anyone 
can create their own codecs. There probably are people who need CESU-8 
for compatibility with other apps, and if the std lib can include 
UTF-8-sig, it can probably include CESU-8. Or it can be left for those 
who need it to implement it themselves.

> Because IIRC, CESU-8 says that 
> (in Python terms) '\U00010400' and '\uD801\uDC00' should both encode 
> to b'\xED\xA0\x81\xED\xB0\x80', instead of the former encoding to 
> b'\xF0\x90\x90\x80' and the latter not being encodable because it's 
> not a string.

Sounds about right as far as the first half goes:

http://unicode.org/reports/tr26/

As far as the second half goes, the TR doesn't say anything about 
processing surrogate pairs in the source Unicode string. Since (strict) 
Unicode strings cannot contain surrogates, I think that CESU-8 should 
treat it as an error just like UTF-8. The TR does say:

    CESU-8 defines an encoding scheme for Unicode identical to
    UTF-8 except for its representation of supplementary characters. 

That seems pretty clear to me: if '\uDC00'.encode('utf-8') raises an 
error, then so should '\uDC00'.encode('cesu-8').

-- 
Steve