Making IDLE3 ignore non-BMP characters instead of throwing an exception?

Fri Oct 21 07:16:36 EDT 2016

On 2016-10-17, eryk sun wrote:

> On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>> I'm using IDLE 3 (with python 3.5.2) to work interactively with
>> Twitter data, which of course contains emojis.  Whenever the running
>> program tries to print the text of a tweet with an emoji, it barfs
>> this & stops running:
>>
>>   UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>>   position 102-102: Non-BMP character not supported in Tk
>>
>> Is there any way to set IDLE to ignore these characters (either drop
>> them or replace them with something else) instead of throwing the
>> exception?
>>
>> If not, what's the best way to strip them out of the string before
>> printing?
>
> You can patch print() to transcode non-BMP characters as surrogate
> pairs. For example:
>
>     import builtins
>
>     def print_ucs2(*args, print=builtins.print, **kwds):
>         args2 = []
>         for a in args:
>             a = str(a)
>             if max(a) > '\uffff':
>                 b = a.encode('utf-16le', 'surrogatepass')
>                 chars = [b[i:i+2].decode('utf-16le', 'surrogatepass')
>                          for i in range(0, len(b), 2)]
>                 a = ''.join(chars)
>             args2.append(a)
>         print(*args2, **kwds)
>
>     builtins._print = builtins.print
>     builtins.print = print_ucs2
>
> On Windows this should allow printing non-BMP characters such as
> emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a
> pair of empty boxes. If you're not using Windows you can modify this
> to print something else for non-BMP characters, such as a replacement
> character or \U literals.

Clever, thanks.  (I'm actually using Linux.)

-- 
Consistently separating words by spaces became a general custom about
the tenth century A. D., and lasted until about 1957, when FORTRAN
abandoned the practice.              --- Sun FORTRAN Reference Manual