Making IDLE3 ignore non-BMP characters instead of throwing an exception?

eryk sun eryksun at gmail.com
Mon Oct 17 23:03:34 EDT 2016


On Tue, Oct 18, 2016 at 2:09 AM, Chris Angelico <rosuav at gmail.com> wrote:
> That's not a UTF-16 encoded byte string, though. It's a Unicode string
> that contains two surrogates. So maybe the solution is to convert from
> true Unicode strings into strings like the above - but if so, it
> absolutely must not be done in any user-facing way. It should be an
> implementation detail of Tkinter.

Yes, it's an invalid Unicode string, since it contains surrogate
codes. At the C level this gets passed as a UTF-16 string, even in
Unix, i.e. in most cases a Tcl_UniChar is defined as a C unsigned
short since the macro TCL_UTF_MAX defaults to 3 (UTF-8 bytes).

As I said, I'm not experienced with TCL/Tk enough to know whether
UTF-16 strings with surrogate pairs cause other problems. On Linux it
prints the surrogate codes as empty box characters, which is certainly
ugly and also incorrect to print two characters in place of one. It
seems that TCL's UTF-8 conversion doesn't work with UTF-16. Thus
supporting non-BMP characters would be limited to Windows until the
default TCL_UTF_MAX is greater than 3 on Unix platforms. Supposedly
this has actually worked in the core TCL implementation for some time,
but extensions are holding it back.



More information about the Python-list mailing list