[issue13153] IDLE crash with unicode bigger than 0xFFFF

Sat Oct 15 07:48:29 CEST 2011

Terry J. Reedy <tjreedy at udel.edu> added the comment:

[Yes, indexing will still be O(1), though I personally consider that less important than most make it to be. Consistency across platforms and total time and space performance of typical apps should be the concern. There is ongoing work on improving the new implementation. Some operations already take less space and run faster.]

The traceback may very well be helpful. It implies that copying a supplemental char does not produce proper utf-8 encoded bytes. Or if it does, tkinter (or tk underneath it) does not recognize them. But then the problem should be the initial byte, not the continuation bytes, which are the same for all chars and which all have 10 for their two high order bits. See
https://secure.wikimedia.org/wikipedia/en/wiki/Utf-8
for a fuller explanation.

Line 1009 is the definition of Misc.mainloop(). I believe self.tk represents the embedded tcl interpreter, which is a black box from Python's viewpoint. Perhaps we should wrap the call with

try:
  self.tk.mainloop(n)
except Exception as e:
  <print error message with all info attached to e before exiting>

This should catch any miscellaneous crashes which are not otherwise caught and maybe turn the crash issues into bug reports -- the same way that running from the command line did. (It will still be good to catch what we can at error sites and give better, more specific messages.) (What I am not familiar with is how the command line interpreter might turn a tcl error into a python exception and why IDLE does not.)

When I copy '𐒢' and paste into the command line interpreter or Notepad++, I get '??'. I am guessing that ?? represent a surrogate pair and that Windows separately encodes each. The result would be 'illegal' utf-8 with an illegal continuation chars. An application can choose to decode the 'illegal' utf-8 -- or not. Python can when "errors='surrogate_escape" (or something like that) is specified. It might be possible to access the raw undecoded bytes of the clipboard with the third party pythonwin module. I do not know if there is anyway to do so with tk.

I wonder if tcl is calling back to Python for decoding and whether there was a change in the default for errors or the callback specification that would explain a change from 2.7 to 3.2.

Ezio, do you know anything about these speculations?

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13153>
_______________________________________