Decoding bytes to text strings in Python 2
Rayner Lucas
usenet202101 at magic-cookie.co.ukNOSPAMPLEASE
Fri Jun 21 11:49:08 EDT 2024
I'm curious about something I've encountered while updating a very old
Tk app (originally written in Python 1, but I've ported it to Python 2
as a first step towards getting it running on modern systems). The app
downloads emails from a POP server and displays them. At the moment, the
code is completely unaware of character encodings (which is something I
plan to fix), and I have found that I don't understand what Python is
doing when no character encoding is specified.
To demonstrate, I have written this short example program that displays
a variety of UTF-8 characters to check whether they are decoded
properly:
---- Example Code ----
import Tkinter as tk
window = tk.Tk()
mytext = """
\xc3\xa9 LATIN SMALL LETTER E WITH ACUTE
\xc5\x99 LATIN SMALL LETTER R WITH CARON
\xc4\xb1 LATIN SMALL LETTER DOTLESS I
\xef\xac\x84 LATIN SMALL LIGATURE FFL
\xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q
\xc2\xbd VULGAR FRACTION ONE HALF
\xe2\x82\xac EURO SIGN
\xc2\xa5 YEN SIGN
\xd0\x96 CYRILLIC CAPITAL LETTER ZHE
\xea\xb8\x80 HANGUL SYLLABLE GEUL
\xe0\xa4\x93 DEVANAGARI LETTER O
\xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57
\xe2\x99\xa9 QUARTER NOTE
\xf0\x9f\x90\x8d SNAKE
\xf0\x9f\x92\x96 SPARKLING HEART
"""
mytext = mytext.decode(encoding="utf-8")
greeting = tk.Label(text=mytext)
greeting.pack()
window.mainloop()
---- End Example Code ----
This works exactly as expected, with all the characters displaying
correctly.
However, if I comment out the line 'mytext = mytext.decode
(encoding="utf-8")', the program still displays *almost* everything
correctly. All of the characters appear correctly apart from the two
four-byte emoji characters at the end, which instead display as four
characters. For example, the "SNAKE" character actually displays as:
U+00F0 LATIN SMALL LETTER ETH
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
U+FF90 HALFWIDTH KATAKANA LETTER MI
U+FF8D HALFWIDTH KATAKANA LETTER HE
What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii',
but it's clearly not attempting to display the bytes as ASCII (or
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but-
not-quite UTF-8 decoding?
I am using Python 2.7.18 on a Windows 10 system. If there's any other
relevant information I should provide please let me know.
Many thanks,
Rayner
More information about the Python-list
mailing list