Making IDLE3 ignore non-BMP characters instead of throwing an exception?
Adam Funk
a24061 at ducksburg.com
Fri Oct 21 07:16:36 EDT 2016
On 2016-10-17, eryk sun wrote:
> On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>> I'm using IDLE 3 (with python 3.5.2) to work interactively with
>> Twitter data, which of course contains emojis. Whenever the running
>> program tries to print the text of a tweet with an emoji, it barfs
>> this & stops running:
>>
>> UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>> position 102-102: Non-BMP character not supported in Tk
>>
>> Is there any way to set IDLE to ignore these characters (either drop
>> them or replace them with something else) instead of throwing the
>> exception?
>>
>> If not, what's the best way to strip them out of the string before
>> printing?
>
> You can patch print() to transcode non-BMP characters as surrogate
> pairs. For example:
>
> import builtins
>
> def print_ucs2(*args, print=builtins.print, **kwds):
> args2 = []
> for a in args:
> a = str(a)
> if max(a) > '\uffff':
> b = a.encode('utf-16le', 'surrogatepass')
> chars = [b[i:i+2].decode('utf-16le', 'surrogatepass')
> for i in range(0, len(b), 2)]
> a = ''.join(chars)
> args2.append(a)
> print(*args2, **kwds)
>
> builtins._print = builtins.print
> builtins.print = print_ucs2
>
> On Windows this should allow printing non-BMP characters such as
> emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a
> pair of empty boxes. If you're not using Windows you can modify this
> to print something else for non-BMP characters, such as a replacement
> character or \U literals.
Clever, thanks. (I'm actually using Linux.)
--
Consistently separating words by spaces became a general custom about
the tenth century A. D., and lasted until about 1957, when FORTRAN
abandoned the practice. --- Sun FORTRAN Reference Manual
More information about the Python-list
mailing list