[Python-ideas] Processing surrogates in

Tue May 5 19:33:28 CEST 2015

On May 5, 2015, at 03:46, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> 
> Andrew Barnert writes:
> 
>> (I'm not sure if we actually have a UCS-2 codec, but if not, it's
>> trivial to write--it's just UTF-16 without surrogates.)
> 
> The PEP 393 machinery knows when astral characters are introduced
> because it has to widen the representation.  That might be a more
> convenient place to raise an exception on non-BMP characters.
> 
But the PEP 393 machinery doesn't know when it's dealing with strings that are ultimately destined for a UCS-2 application, any more than it can know when it's dealing with strings that have to be pure ASCII or CP1252 or any other character set.

If you want to print emoji to a CP1252 console or write them to a Shift-JIS text file, you get an error from an explicit or implicit `str.encode` that you can debug. If you want to display emoji in a Tkinter GUI, it should be exactly the same. The only reason it isn't is that we pretend "narrow Unicode" is a real thing and implicitly convert to UTF-16 instead of making the code explicitly specify UCS-2 or UTF-16 as appropriate.