Decoding bytes to text strings in Python 2

Fri Jun 21 11:49:08 EDT 2024

I'm curious about something I've encountered while updating a very old 
Tk app (originally written in Python 1, but I've ported it to Python 2 
as a first step towards getting it running on modern systems). The app 
downloads emails from a POP server and displays them. At the moment, the 
code is completely unaware of character encodings (which is something I 
plan to fix), and I have found that I don't understand what Python is 
doing when no character encoding is specified.

To demonstrate, I have written this short example program that displays 
a variety of UTF-8 characters to check whether they are decoded 
properly:

---- Example Code ----
import Tkinter as tk

window = tk.Tk()

mytext = """
  \xc3\xa9 LATIN SMALL LETTER E WITH ACUTE
  \xc5\x99 LATIN SMALL LETTER R WITH CARON
  \xc4\xb1 LATIN SMALL LETTER DOTLESS I
  \xef\xac\x84 LATIN SMALL LIGATURE FFL
  \xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q
  \xc2\xbd VULGAR FRACTION ONE HALF
  \xe2\x82\xac EURO SIGN
  \xc2\xa5 YEN SIGN
  \xd0\x96 CYRILLIC CAPITAL LETTER ZHE
  \xea\xb8\x80 HANGUL SYLLABLE GEUL
  \xe0\xa4\x93 DEVANAGARI LETTER O
  \xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57
  \xe2\x99\xa9 QUARTER NOTE
  \xf0\x9f\x90\x8d SNAKE
  \xf0\x9f\x92\x96 SPARKLING HEART
"""

mytext = mytext.decode(encoding="utf-8")
greeting = tk.Label(text=mytext)
greeting.pack()

window.mainloop()
---- End Example Code ----

This works exactly as expected, with all the characters displaying 
correctly.

However, if I comment out the line 'mytext = mytext.decode
(encoding="utf-8")', the program still displays *almost* everything 
correctly. All of the characters appear correctly apart from the two 
four-byte emoji characters at the end, which instead display as four 
characters. For example, the "SNAKE" character actually displays as:
U+00F0 LATIN SMALL LETTER ETH
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
U+FF90 HALFWIDTH KATAKANA LETTER MI
U+FF8D HALFWIDTH KATAKANA LETTER HE

What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii', 
but it's clearly not attempting to display the bytes as ASCII (or 
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but-
not-quite UTF-8 decoding?

I am using Python 2.7.18 on a Windows 10 system. If there's any other 
relevant information I should provide please let me know.

Many thanks,
Rayner