[Tutor] unicode question

Mon Jun 6 17:56:29 EDT 2022

On 06/06/2022 22:04, Alex Kleider wrote:
> I've been playing around with unicode a bit and found that the
> following code doesn't behave as I might have expected:
>
> Python 3.9.2 (default, Feb 28 2021, 17:03:44)
> [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> print("\N{middle dot}")
> ·
>>>>
>>>> middle_dot = '0140', '013F', '00B7', '2027'
>>>> ucode = ['\\u' + dot for dot in middle_dot]
>>>> for dot in ucode:
> ...     print(dot)
> ...
> \u0140
> \u013F
> \u00B7
> \u2027
>>>> print("\u0140")
> ŀ
>>>> print("\u013f")
> Ŀ
>>>> print("\u00b7")  # the one I want
> ·
>>>> print("\u2027")
> ‧
>>>>
>
> I was expecting the for loop to output the same as the last four print
> statements but, alas, not so.

"\\u" is a string containing the backslash followed by a "u" -- and that
won't change when you concatenate another string like "0140".

The easiest way to realize the loop would be to use integers:

 >>> for i in 0x140, 0x13f: print(chr(i))

ŀ
Ŀ

The obvious way when you want to start with strings is

 >>> for c in "0140", "013f":
     print(eval(f"'\\u{c}'"))  # dangerous, may execute arbitrary code

ŀ
Ŀ

with the safe alternative

 >>> for c in "0140", "013f":
	print(codecs.decode(f"\\u{c}", "unicode-escape"))

ŀ
Ŀ