Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic. As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion). To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py: def 𝚑𝓮𝖑𝒍𝑜(): try: 𝔥e𝗅𝕝𝚘︴ = "Hello" 𝕨𝔬r𝓵ᵈ﹎ = "World" ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!") except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c: 𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬)) if _︴ⁿ𝓪𝑚𝕖__ == "__main__": 𝒉eℓˡ𝗈() # snippet from unittest/util.py _𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12 def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗): ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩: 𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:]) return ₛ You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is. (If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466> . I have a second gist containing the transformer, but it is still a private gist atm.) Some other discoveries: “·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”. “_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”. Potential beneficial uses: I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye. -- Paul McGuire

This is my favourite version of the issue: е = lambda е, e: е if е > e else e print(е(2, 1), е(1, 2)) # python 3 outputs: 2 2 https://twitter.com/stestagg/status/685239650064162820?s=21 Steve On Sat, 13 Nov 2021 at 22:05, <ptmcg@austin.rr.com> wrote:
I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.
As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).
To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓( 𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have a second gist containing the transformer, but it is still a private gist atm.)
Some other discoveries:
“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
Potential beneficial uses:
I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.
-- Paul McGuire
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZT... Code of Conduct: http://python.org/psf/codeofconduct/

On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
I’ve not been following the thread, but Steve Holden forwarded me the
To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
Wow. After pasting the util.py snippet into current IDLE, which on my Windows machine* displays the complete text:
dir() ['_PLACEHOLDER_LEN', '__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_shorten'] _shorten('abc', 1, 1) 'abc' _shorten('abcdefghijklmnopqrw', 2, 2) 'ab[15 chars]rw'
* Does not at all work in CommandPrompt, even after supposedly changing to a utf-8 codepage with 'chcp 65000'. -- Terry Jan Reedy

On 11/13/21, Terry Reedy <tjreedy@udel.edu> wrote:
On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
* Does not at all work in CommandPrompt
It works for me when pasted into the REPL using the console in Windows 10. I pasted the code into a raw multiline string assignment and then executed the string with exec(). The only issue is that most of the pasted characters are displayed using the font's default glyph since the console host doesn't have font fallback support. Even Windows Terminal doesn't have font fallback support yet in the command-line editing mode that Python's REPL uses. But Windows Terminal does implement font fallback for normal output rendering, so if you assign the pasted text to string `s`, then print(s) should display properly.
even after supposedly changing to a utf-8 codepage with 'chcp 65000'.
Changing the console code page is unnecessary with Python 3.6+, which uses the console's wide-character API. Also, even though it's irrelevant for the REPL, UTF-8 is code page 65001. Code page 65000 is UTF-7.

On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
Wow. Just Wow. So why does Python apply NFKC normalization to variable names?? I can't for the life of me figure out why that would be helpful at all. The string methods, sure, but names? And, in fact, the normalization is not used for string comparisons or hashes as far as I can tell. In [36]: weird Out[36]: 'ᵖ𝖗𝐢𝘯𝓽' In [37]: normal Out[37]: 'print' In [38]: eval(weird + "('yup, that worked')") yup, that worked In [39]: weird == normal Out[39]: False In [40]: weird[0] in normal Out[40]: False This seems very odd (and dangerous) to me. Is there a good reason? and is it too late to change it? -CHB
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓( 𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have a second gist containing the transformer, but it is still a private gist atm.)
Some other discoveries:
“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
Potential beneficial uses:
I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.
-- Paul McGuire
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZT... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 2021-11-14 17:17, Christopher Barker wrote:
On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com <mailto:ptmcg@austin.rr.com>> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
__
try:____
𝔥e𝗅𝕝𝚘︴ = "Hello"____
𝕨𝔬r𝓵ᵈ﹎ = "World"____
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")____
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:____
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
Wow. Just Wow.
So why does Python apply NFKC normalization to variable names?? I can't for the life of me figure out why that would be helpful at all.
The string methods, sure, but names?
And, in fact, the normalization is not used for string comparisons or hashes as far as I can tell.
[snip] It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", which are different ways of writing the same thing. Unfortunately, it goes too far, because it's unlikely that we want "ᵖ" ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN CAPITAL LETTER P}".

On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:
So why does Python apply NFKC normalization to variable names??
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", which are different ways of writing the same thing.
sure, but this is code, written by humans (or meta-programming). Maybe I'm showing my english bias, but would it be that limiting to have identifiers be based on codepoints, period? Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file? But if so ...
Unfortunately, it goes too far, because it's unlikely that we want "ᵖ" ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN CAPITAL LETTER P}".
Is it possible to only capture things like the combining characters and not the "equivalent" ones like the above? -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, 14 Nov 2021, 19:07 Christopher Barker, <pythonchb@gmail.com> wrote:
On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:
Unfortunately, it goes too far, because it's unlikely that we want "ᵖ" ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN CAPITAL LETTER P}".
Is it possible to only capture things like the combining characters and not the "equivalent" ones like the above?
Yes, that is NFC. NKFC converts to equivalent characters and also composes; NFC just composes.

On 11/14/21 2:07 PM, Christopher Barker wrote:
Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?
The issue here is that fundamentally, some editors will produce composed characters and some decomposed characters to represent the same actual 'character' These two methods are defined by Unicode to really represent the same 'character', it is just that some defined sequences of combining codepoints just happen to have a composed 'abbreviation' defined also. Having to exactly match the byte sequence says that some people will have a VERY hard time entering usable code if there tools support Unicode, but use the other convention. -- Richard Damon

On Sun, Nov 14, 2021, 2:14 PM Christopher Barker
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", which are different ways of writing the same thing.
Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?
Imagine that two different programmers work with the same code base, and their text editors or keystrokes enter "é" in different ways. Or imagine just one programmer doing so on two different machines/environments. As an example, I wrote this reply on my Android tablet (with such-and-such OS version). I have no idea what actual codepoint(s) are entered when I press and hold the "e" key for a couple seconds to pop up character variations. If I wrote it on OSX, I'd probably press "alt-e e" on my US International key layout. Again, no idea what codepoints actually are entered. If I did it on Linux, I'd use "ctrl-shift u 00e9". In that case, I actually know the codepoint.

On 11/14/21 2:36 PM, David Mertz, Ph.D. wrote:
On Sun, Nov 14, 2021, 2:14 PM Christopher Barker
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}", which are different ways of writing the same thing.
Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?
Imagine that two different programmers work with the same code base, and their text editors or keystrokes enter "é" in different ways.
Or imagine just one programmer doing so on two different machines/environments.
As an example, I wrote this reply on my Android tablet (with such-and-such OS version). I have no idea what actual codepoint(s) are entered when I press and hold the "e" key for a couple seconds to pop up character variations.
If I wrote it on OSX, I'd probably press "alt-e e" on my US International key layout. Again, no idea what codepoints actually are entered. If I did it on Linux, I'd use "ctrl-shift u 00e9". In that case, I actually know the codepoint.
But would have to look up the actual number to enter them. Imagine of ALL your source code had to be entered via code-point numbers. BTW, you should be able to enable 'composing' under Linux too, just like under OSX with the right input driver loaded. -- Richard Damon

Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html section 5 says: "if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate" (vs NFKC for a language with case-insensitive identifiers) so to follow the standard we should have used NFC rather than NFKC. Not sure if it's too late to fix this "oops" in future Python versions. Alex On Sun, Nov 14, 2021 at 9:17 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
Wow. Just Wow.
So why does Python apply NFKC normalization to variable names?? I can't for the life of me figure out why that would be helpful at all.
The string methods, sure, but names?
And, in fact, the normalization is not used for string comparisons or hashes as far as I can tell.
In [36]: weird Out[36]: 'ᵖ𝖗𝐢𝘯𝓽'
In [37]: normal Out[37]: 'print'
In [38]: eval(weird + "('yup, that worked')") yup, that worked
In [39]: weird == normal Out[39]: False
In [40]: weird[0] in normal Out[40]: False
This seems very odd (and dangerous) to me.
Is there a good reason? and is it too late to change it?
-CHB
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓 (𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have a second gist containing the transformer, but it is still a private gist atm.)
Some other discoveries:
“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
Potential beneficial uses:
I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.
-- Paul McGuire
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZT... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD (Chris)
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Compatibility variants can look different, but they can also look identical. Allowing any non-ASCII characters was worrisome because of the security implications of confusables. Squashing compatibility characters seemed the more conservative choice at the time. Stestagg's example: е = lambda е, e: е if е > e else e shows it wasn't perfect, but adding more invisible differences does have risks, even beyond the backwards incompatibility and the problem with (hopefully rare, but are we sure?) editors that don't distinguish between them in the way a programming language would prefer. I think (but won't swear) that there were also several problematic characters that really should have been treated as (at most) glyph variants, but ... weren't. If I Recall Correctly, the largest number were Arabic presentation forms, but there were also a few characters that were in Unicode only to support round-trip conversion with a legacy charset, even if that charset had been declared buggy. In at least a few of these cases, it seemed likely that a beginning user would expect them to be equivalent. -jJ

ptmcg@austin.rr.com wrote:
... add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair.
There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII already a problem? And if we do it at all, is there any way to avoid making Cyrillic languages second-class? I'm not quickly finding the contemporary report, but these should be helpful if you want to go deeper: http://www.unicode.org/reports/tr36/ http://unicode.org/reports/tr36/confusables.txt https://util.unicode.org/UnicodeJsps/confusables.jsp
I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”.
Here I don't see the problem. Things that look slightly different are really the same, and you can write it either way. So you can use what looks like a funny font, but the closest it comes to a security risk is that maybe you could access something without a casual reader realizing that you are doing so. They would know that you *could* access it, just not that you *did*.
Some other discoveries: “·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier.
That and the apostrophe are Unicode consortium regrets, because they are normally punctuation, but there are also languages that use them as letters. The apostrophe is (supposedly only) used by Afrikaans, I asked a native speaker about where/how often it was used, and the similarity to Dutch was enough that Guido felt comfortable excluding it. (It *may* have been similar to using the apostrophe for a contraction in English, and saying it therefore represents a letter, but the scope was clearly smaller.) But the dot is used in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed for sensible identifiers. It is worth listing as a warning, and linters should probably complain.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
So go ahead and warn, but it isn't clear how that could be abused to look like something other than a syntax error, except maybe through soft keywords. (Ha! I snuck in a call to async︳def that had been imported with *, and you didn't worry about the import *, or the apparently wild cursor position marker, or the strange async definition that was never used! No way I could have just issued a call to _flush and done the same thing!)
Potential beneficial uses: I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors.
I kind of like the idea, but ... if you're doing it on-the-fly in the editor, you could just use different fonts. If you're actually saving those changes, it seems likely to lead to a lot of spurious diffs if anyone uses a different editor. -jJ

Out of all the approximately thousand bazillion ways to write obfuscated Python code, which may or may not be malicious, why are Unicode confusables worth this level of angst and concern? I looked up "Unicode homoglyph" on CVE, and found a grand total of seven hits: https://www.cvedetails.com/google-search-results.php?q=unicode+homoglyph all of which appear to be related to impersonation of account names. I daresay if I expanded my search terms, I would probably find some more, but it is clear that Unicode homoglyphs are not exactly a major threat. In my opinion, the other Steve's (Stestagg) example of obfuscated code with homoglyphs for e (as well as a few similar cases, such as homoglyphs for A) mostly makes for an amusing curiosity, perhaps worth a plugin for Pylint and other static checkers, but not much more. I'm not entirely sure what Paul's more lurid examples are supposed to indicate. If your threat relies on a malicious coder smuggling in identifiers like "𝚑𝓮𝖑𝒍𝑜" or "ªº" and having the reader not notice, then I'm not going to lose much sleep over it. Confusable account names and URL spoofing are proven, genuine threats. Beyond that, IMO the actual threat window from confusables is pretty small. Yes, you can write obfuscated code, and smuggle in calls to unexpected functions: result = lеn(sequence) # Cyrillic letter small Ie but you still have to smuggle in a function to make it work: def lеn(obj): # something malicious And if you can do that, the Unicode letter is redundant. I'm not sure why any attacker would bother. -- Steve

On Sun, Nov 14, 2021 at 4:53 PM Steven D'Aprano <steve@pearwood.info> wrote:
Out of all the approximately thousand bazillion ways to write obfuscated Python code, which may or may not be malicious, why are Unicode confusables worth this level of angst and concern?
I for one am not full of angst nor particularly concerned. Though ti's a fine idea to inform folks about h this issues. I am, however, surprised and disappointed by the NKFC normalization. For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized. Then there's the question of when this normalization happens (and when it doesn't). If one is doing any kind of metaprogramming, even just using getattr() and setattr(), things could get very confusing: In [55]: class Junk: ...: 𝗵e𝓵𝔩º = "hello" ...: In [56]: setattr(Junk, "ᵖ𝖗𝐢𝘯𝓽", "print") In [57]: dir(Junk) Out[57]: '__weakref__', <snip> 'hello', 'ᵖ𝖗𝐢𝘯𝓽'] In [58]: Junk.hello Out[58]: 'hello' In [59]: Junk.𝗵e𝓵𝔩º Out[59]: 'hello' In [60]: Junk.print --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-60-f2a7d3de5d06> in <module> ----> 1 Junk.print AttributeError: type object 'Junk' has no attribute 'print' In [61]: Junk.ᵖ𝖗𝐢𝘯𝓽 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-61-004f4c8b2f07> in <module> ----> 1 Junk.ᵖ𝖗𝐢𝘯𝓽 AttributeError: type object 'Junk' has no attribute 'print' In [62]: getattr(Junk, "ᵖ𝖗𝐢𝘯𝓽") Out[62]: 'print' Would a proposal to switch the normalization to NFC only have any hope of being accepted? and/or adding normaliztion to setattr() and maybe other places where names are set in code? -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
Would a proposal to switch the normalization to NFC only have any hope of being accepted?
Hope, yes. Counting you, it's been proposed twice. :-) I don't know whether it would get through. We know this won't affect the stdlib, since that's restricted to ASCII. I suppose we could trawl PyPI and GitHub for "compatibles" (the Unicode term for "K" normalizations).
For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.
Independent of the question of the normalization of Python identifiers, I think using those characters this way is a bad idea. In fact, I think adding these symbols to Unicode was a bad idea; they should be handled at a higher level in the linguistic stack (by semantic markup). You're confusing two things here. In Unicode, a script is a collection of characters used for a specific language, typically a set of Unicode blocks of characters (more or less; there are a lot of Han ideographs that are recognizable as such to Japanese but are not part of the repertoire of the Japanese script). That is, these characters are *different* from others that look like them. Blackboard Bold is more what we would usually call a "font": the (math) italic "x" and the (math) bold italic "x" are the same "x", but one denotes a scalar and the other a vector in many math books. A roman "R" probably denotes the statistical application, an italic "R" the reaction function in game theory model, and a Blackboard Bold "R" the set of real numbers. But these are all the same character. It's a bad idea to rely on different (Unicode) scripts that use the same glyphs for different characters to look different from each other, unless you "own" the fonts to be used. As far as I know there's no way for a Python program to specify the font to be used to display itself though. :-) It's also a UX problem. At slightly higher layer in the stack, I'm used to using Japanese input methods to input sigma and pi which produce characters in the Greek block, and at least the upper case forms that denote sum and product have separate characters in the math operators block. I understand why people who literally write mathematics in Greek might want those not normalized, but I sure am going to keep using "Greek sigma", not "math sigma"! The probability that I'm going to have a Greek uppercase sigma in my papers is nil, the probability of a summation symbol near unity. But the summation symbol is not easily available, I have to scroll through all the preceding Unicode blocks to find Mathematical Operators. So I am perfectly happy with uppercase Greek sigma for that role (as is XeTeX!!) And the thing is, of course those Greek letters really are Greek letters: they were chosen because pi is the homophone of p which is the first letter of "product", and sigma is the homophone of s which is the first letter of "sum". Å for Ångström is similar, it's the initial letter of a Swedish name. Sure, we could fix the input methods (and search methods!! -- people are going to input the character they know that corresponds to the glyph *they* see, not the bit pattern the *CPU* sees). But that's as bad as trying to fix mail clients. Not worth the effort because I'm pretty sure you're gonna fail -- it's one of those "you'll have to pry this crappy software that annoys admins around the world from my cold dead fingers" issues, which is why their devs refuse to fix them. Steve

On 15. 11. 21 9:25, Stephen J. Turnbull wrote:
Christopher Barker writes:
Would a proposal to switch the normalization to NFC only have any hope of being accepted?
Hope, yes. Counting you, it's been proposed twice. :-) I don't know whether it would get through. We know this won't affect the stdlib, since that's restricted to ASCII. I suppose we could trawl PyPI and GitHub for "compatibles" (the Unicode term for "K" normalizations).
I don't think PyPI/GitHub are good resources to trawl. Non-ASCII identifiers were added for the benefit of people who use non-English languages. But both on PyPI and GitHub are overwhelmingly projects written in English -- especially if you look at the more popular projects. It would be interesting to reach out to the target audience here... but they're not on this list, either. Do we actually know anyone using this? I do teach beginners in a non-English language, but tell them that they need to learn English if they want to do any serious programming. Any code that's to be shared more widely than a country effectively has to be in English. It seems to me that at the level where you worry about supply chain attacks and you're doing code audits, something like CPython's policy (ASCII only except proper names and Unicode-related tests) is a good idea. Or not? I don't know anyone who actually uses non-ASCII identifiers for a serious project.

Stephen J. Turnbull wrote:
Christopher Barker writes:
For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.
Agreed, for careful writers. But Stephen's answer about people using the wrong one and expecting it to work means that normalization is probably the lesser of evils for most people, and the ones who don't want it normalized are more likely to be able to specify custom processing when it is important enough. (The compatibility characters aren't normalized in strings, largely because that should still be possible.)
In fact, I think adding these symbols to Unicode was a bad idea; they should be handled at a higher level in the linguistic stack (by semantic markup).
When I was a math student, these were clearly different symbols, with much less relation to each other than a mere case difference. So by the Unicode consortium's goals, they are independent characters that should each be defined. I admit that isn't ideal for most use cases outside of math, but ... supporting those other cases is what compatibility normalization is for.
It's also a UX problem. At slightly higher layer in the stack, I'm used to using Japanese input methods to input sigma and pi which produce characters in the Greek block, and at least the upper case forms that denote sum and product have separate characters in the math operators block. I understand why people who literally write mathematics in Greek might want those not normalized, but I sure am going to keep using "Greek sigma", not "math sigma"! The probability that I'm going to have a Greek uppercase sigma in my papers is nil, the probability of a summation symbol near unity. But the summation symbol is not easily available, I have to scroll through all the preceding Unicode blocks to find Mathematical Operators. So I am perfectly happy with uppercase Greek sigma for that role (as is XeTeX!!)
I think that is mostly a backwards compatibility problem; XeTeX itself had to worry about compatibility with TeX (which preceded Unicode) and with the fonts actually available and then with earlier versions of XeTeX. -jJ

Executive summary: I guess the bottom line is that I'm sympathetic to both the NFC and NFKC positions. I think that wetware is such that people will go to the trouble of picking out a letter-like symbol from a palette rarely, and in my environment that's not going to happen at all because I use Japanese phonetic input to get most symbols ("sekibun" = integral, "siguma" = sigma), and I don't use calligraphic R for the real line, I use \newcommand{\R}{{\cal R}}, except on a physical whiteboard, where I use blackboard bold (go figure that one out!) So to my mind the letter-like block in Unicode is a failed experiemnt. Jim J. Jewett writes:
When I was a math student, these were clearly different symbols, with much less relation to each other than a mere case difference.
Arguable. The letter-like symbols block has script (cursive), blackboard bold, and Fraktur versions of R. I've seen all of them as well as plain Roman, bold, italic and bold italic facts used to denote the real line, and I've personally used most of them for that purpose depending on availability of fonts and input methods and medium (ie, computer text vs. hand-written). I've also seen several of them used for reaction functions or spaces thereof in game theory (although blackboard bold and Fraktur seem to be used uniquely for the real line). Clearly the common denominator is the uppercase latin letter "R", and the glyph being recognizably "R" is necessary and sufficient to each of those purposes. The story for uppercase sigma as sum is somewhat similar: sum is by far not the only use of that letter, although I don't know of any other operator symbol for sum over a set or series (outside of programming languages, which I think we can discount). I agree that we should consider math to be a separate language, but it doesn't have a consistent script independent of the origins of the symbols. Even today none of my engineering and economics students can type any symbols except those in the JIS repertoire, which they type by original name ("siguma", "ramuda", "arefu", "yajirushi" == arrow, etc, "sekibun" == integration does bring up the integral sign in at least some modern input methods, but it doesn't have a script name, while "kasann" == addition does not bring up sigma, although "siguma" does, and "essu" brings up sigma -- but only in "ASCII emoji" strings, go figure). I have seen students use fullwidth R for the real line, though, but distinguishing that is a deprecated compatibility feature of Unicode (and of Japanese practice -- even in very formal university documents such as grade reports for a final doctoral examination I've seen numbers and names containing mixed half-width and full-width ASCII). So I think "letter-like" was a reasonable idea (I'm pretty sure this block goes back to the '90s but I'm too lazy to check), but it hasn't turned out well, and I doubt it ever will.
So by the Unicode consortium's goals, they are independent characters that should each be defined. I admit that isn't ideal for most use cases outside of math,
I don't think it even makes sense *inside* of math for the letter-like symbols. The nature of math means that any "R" will be grabbed for something whose name starts with "r" as soon as that's convenient. Something like the integral sign (which is a stretched "S" for "sum"), OK -- although category theory uses that for "ends" which still don't look anything like integrals even if you turn them inside out, rotate 90 degrees, and paint them blue.
It's also a UX problem. At slightly higher layer in the stack, I'm used to using Japanese input methods to input sigma and pi which produce characters in the Greek block, and at least the upper case forms that denote sum and product have separate characters in the math operators block.
I think that is mostly a backwards compatibility problem; XeTeX itself had to worry about compatibility with TeX (which preceded Unicode) and with the fonts actually available and then with earlier versions of XeTeX.
IMO, the analogy fails because the backward compatibility issue for Unicode is in the wetware, not in the software. Steve

On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
I am, however, surprised and disappointed by the NKFC normalization.
For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.
Hmmm... would you really want these to all be different identifiers? 𝕭 𝓑 𝑩 𝐁 B You're assuming the reader of the code has the right typeface to view them (rather than as mere boxes), and that their eyesight is good enough to distinguish the variations even if their editor applies bold or italic as part of syntax highlighting. That's very bold of you :-) In any case, the question of NFKC versus NFC was certainly considered, but unfortunately PEP 3131 doesn't document why NFKC was chosen. https://www.python.org/dev/peps/pep-3131/ Before we change the normalisation rules, it would probably be a good idea to trawl through the archives of the mailing list and work out why NFKC was chosen in the first place, or contact Martin von Löwis and see if he remembers.
Then there's the question of when this normalization happens (and when it doesn't). If one is doing any kind of metaprogramming, even just using getattr() and setattr(), things could get very confusing:
For ordinary identifiers, they are normalised at some point during compilation or interpretation. It probably doesn't matter exactly when. Strings should *not* be normalised when using subscripting on a dict, not even on globals(): https://bugs.python.org/issue42680 I'm not sure about setattr and getattr. I think that they should be normalised. But apparently they aren't:
from types import SimpleNamespace obj = SimpleNamespace(B=1) setattr(obj, '𝕭', 2) obj namespace(B=1, 𝕭=2) obj.B 1 obj.𝕭 1
See also here: https://bugs.python.org/issue35105 -- Steve

On 15.11.2021 12:36, Steven D'Aprano wrote:
On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
I am, however, surprised and disappointed by the NKFC normalization.
For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.
Hmmm... would you really want these to all be different identifiers?
𝕭 𝓑 𝑩 𝐁 B
You're assuming the reader of the code has the right typeface to view them (rather than as mere boxes), and that their eyesight is good enough to distinguish the variations even if their editor applies bold or italic as part of syntax highlighting. That's very bold of you :-)
In any case, the question of NFKC versus NFC was certainly considered, but unfortunately PEP 3131 doesn't document why NFKC was chosen.
https://www.python.org/dev/peps/pep-3131/
Before we change the normalisation rules, it would probably be a good idea to trawl through the archives of the mailing list and work out why NFKC was chosen in the first place, or contact Martin von Löwis and see if he remembers.
This was raised in the discussion, but never conclusively answered: https://mail.python.org/pipermail/python-3000/2007-May/007995.html NFKC is the standard normalization form when you want remove any typography related variants/hints from the text before comparing strings. See http://www.unicode.org/reports/tr15/ I guess that's why Martin chose this form, since the point was to maintain readability, even if different variants of a character are used in the source code. A "B" in the source code should be interpreted as an ASCII B, even when written as 𝕭 𝓑 𝑩 or 𝐁. This simplifies writing code and does away with many of the security issues you could otherwise run into (where e.g. the absence of an identifier causes the application flow to be different).
Then there's the question of when this normalization happens (and when it doesn't).
It happens in the parser when reading a non-ASCII identifier (see Parser/pegen.c), so only applies to source code, not attributes you dynamically add to e.g. class or module namespaces. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 15 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Well, Yet another issue is adding vulnerabilities in plain sight. Human code reviewers will see this: if user.admin == "something": Static analysers will see if user.admin == "something<hidden chars>": but will not flag it as it's up to the user to verify the logic of things and as such soft authors can plant backdoors in plain sight Kind Regards, Abdur-Rahmaan Janhangeer about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com> github <https://github.com/Abdur-RahmaanJ> Mauritius

On Mon, Nov 15, 2021 at 12:33:54PM +0400, Abdur-Rahmaan Janhangeer wrote:
Yet another issue is adding vulnerabilities in plain sight. Human code reviewers will see this:
if user.admin == "something":
Static analysers will see
if user.admin == "something<hidden chars>":
Okay, you have a string literal with hidden characters. Assuming that your editor actually renders them as invisible characters, rather than "something???" or "something□□□" or "something���" or equivalent. Now what happens? where do you go from there to a vunerability or backdoor? I think it might be a bit obvious that there is something funny going on if I see: if (user.admin == "root" and check_password_securely() or user.admin == "root" # Second string has hidden characters, do not remove it. ): elevate_privileges() even without the comment :-) In another thread, Serhiy already suggested we ban invisible control characters (other than whitespace) in comments and strings. https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A... I think that is a good idea. But beyond the C0 and C1 control characters, we should be conservative about banning "hidden characters" without a *concrete* threat. For example, variation selectors are "hidden", but they change the visual look of emoji and other characters. Even if you think that being able to set the skin tone of your emoji or choose different national flags using variation selectors is pure frippery, they are also necessary for Mongolian and some CJK ideographs. http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors I'm not sure about bidirectional controls; I have to leave that to people with more experience in bidirectional text than I do. I think that many editors in common use don't support bidirectional text, or at least the ones I use don't seem to support it fully or correctly. But for what little it is worth, my feeling is that people who use RTL or bidirectional strings and have editors that support them will be annoyed if we ban them from strings for the comfort of people who may never in their life come across a string containing such bidirectional text. But, if there is a concrete threat beyond "it looks weird", that it another issue.
but will not flag it as it's up to the user to verify the logic of things
There is no reason why linters and code checkers shouldn't check for invisible characters, Unicode confusables or mixed script identifiers and flag them. The interpreter shouldn't concern itself with such purely stylistic issues unless there is a concrete threat that can only be handled by the interpreter itself. -- Steve

Greetings,
Now what happens? where do you go from there to a vunerability or backdoor? I think it might be a bit obvious that there is something funny going on if I see:
if (user.admin == "root" and check_password_securely() or user.admin == "root" # Second string has hidden characters, do not remove it. ): elevate_privileges() Well, it's not so obvious. From Ross Anderson and Nicholas Boucher src: https://trojansource.codes/trojan-source.pdf See appendix H. for Python. with implementations: https://github.com/nickboucher/trojan-source/tree/main/Python Rely precisely on bidirectional control chars and/or replacing look alikes
There is no reason why linters and code checkers shouldn't check for invisible characters, Unicode confusables or mixed script identifiers and flag them. The interpreter shouldn't concern itself with such purely stylistic issues unless there is a concrete threat that can only be handled by the interpreter itself.
I mean current linters. But it will be good to check those for sure. As a programmer, i don't want a language which bans unicode stuffs. If there's something that should be fixed, it's the unicode standard, maybe defining a sane mode where weird unicode stuffs are not allowed. Can also be from language side in the event where it's not being considered in the standard itself. I don't see it as a language fault nor as a client fault as they are considering the unicode docs but the response was mixed with some languages decided to patch it from their side, some linters implementing detection for it as well as some editors flagging it and rendering it as the exploit intended. Kind Regards, Abdur-Rahmaan Janhangeer about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com> github <https://github.com/Abdur-RahmaanJ> Mauritius

On Mon, Nov 15, 2021 at 10:22 PM Abdur-Rahmaan Janhangeer <arj.python@gmail.com> wrote:
Greetings,
Now what happens? where do you go from there to a vunerability or backdoor? I think it might be a bit obvious that there is something funny going on if I see:
if (user.admin == "root" and check_password_securely() or user.admin == "root" # Second string has hidden characters, do not remove it. ): elevate_privileges()
Well, it's not so obvious. From Ross Anderson and Nicholas Boucher src: https://trojansource.codes/trojan-source.pdf
See appendix H. for Python.
with implementations:
https://github.com/nickboucher/trojan-source/tree/main/Python
Rely precisely on bidirectional control chars and/or replacing look alikes
The point of those kinds of attacks is that syntax highlighters and related code review tools would misinterpret them. So I pulled them all up in both GitHub's view and the editor I personally use (SciTE, albeit a fairly old version now). GitHub specifically flags it as a possible exploit in a couple of cases, but also syntax highlights the return keyword appropriately. SciTE doesn't give any sort of warnings, but again, correctly highlights the code - early-return shows "return" as a keyword, invisible-function shows the name "is_" as the function name and the rest not, homoglyph-function shows a quite distinct-looking letter that definitely isn't an H. The problems here are not Python's, they are code reviewers', and that means they're really attacks against the code review tools. It's no different from using the variable m in one place and rn in another, and hoping that code review uses a proportionally-spaced font that makes those look similar. So to count as a viable attack, there needs to be at least one tool that misparses these; so far, I haven't found one, but if I do, wouldn't it be more appropriate to raise the bug report against the tool?
There is no reason why linters and code checkers shouldn't check for invisible characters, Unicode confusables or mixed script identifiers and flag them. The interpreter shouldn't concern itself with such purely stylistic issues unless there is a concrete threat that can only be handled by the interpreter itself.
I mean current linters. But it will be good to check those for sure. As a programmer, i don't want a language which bans unicode stuffs. If there's something that should be fixed, it's the unicode standard, maybe defining a sane mode where weird unicode stuffs are not allowed. Can also be from language side in the event where it's not being considered in the standard itself.
Uhhm..... "weird unicode stuffs"? Please clarify.
I don't see it as a language fault nor as a client fault as they are considering the unicode docs but the response was mixed with some languages decided to patch it from their side, some linters implementing detection for it as well as some editors flagging it and rendering it as the exploit intended.
I see it as an editor issue (or code review tool, as the case may be). You'd be hard-pressed to get something past code review if it looks to everyone else like you slipped a "return" statement at the end of a docstring. So far, I've seen fewer problems from "weird unicode stuffs" than from the quoted-printable encoding, and that's an attack that involves nothing but ASCII text. It's also an attack that far more code review tools seem to be vulnerable to. ChrisA

GitHub specifically flags it as a possible exploit in a couple of cases, but also syntax highlights the return keyword appropriately.
My guess is that Github did patch it afterwards as the paper does list Github as vulnerable
Uhhm..... "weird unicode stuffs"? Please clarify.
Wriggly texts just because they appear different Well, it's tool based but maybe compiler checks aka checks from the language side is something that should be insisted upon too to patch inconsistent checks across editors. The reason i was saying it's related to encodings is that when languages are impacted en masse, maybe it hints to a revision in the unicode standards at the very least warnings. As Steven above even before i posted the paper was hinting towards the vulnerability so maybe those in charge of the unicode standards should study and predict angles of attacks. Kind Regards, Abdur-Rahmaan Janhangeer about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com> github <https://github.com/Abdur-RahmaanJ> Mauritius

On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:
The problems here are not Python's, they are code reviewers', and that means they're really attacks against the code review tools.
I think that's a bit strong. Boucher and Anderson's paper describes multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI attacks does seem to be a novel attack, and probably exploitable. But unfortunately it seems to be the Unicode confusables or homoglyph attack that seems to be getting most of the attention, and that's not new, it is as old as ASCII, and not so easily exploitable. Being able to have А (Cyrillic) Α (Greek alpha) and A (Latin) in the same code base makes for a nice way to write obfuscated code, but it's *obviously* obfuscated and not so easy to smuggle in hostile code. Whereas the BIDI attacks do (apparently) make it easy to smuggle in code: using invisible BIDI control codes, you can introduce source code where the way the editor renders the code, and the way the coder reads it, is different from the way the interpreter or compiler runs it. That is, I think, new and exploitable: something that looks like a comment is actually code that the interpreter runs, and something that looks like code is actually a string or comment which is not executed, but editors may syntax-colour it as if it were code. Obviously we can mitigate against this by improving the editors (at the very least, all editors should have a Show Invisible Characters option). Linters and code checks should also flag problematic code containing BIDI codes, or attacks against docstrings. Beyond that, it is not clear to me what, if anything, we should do in response to this new class of Trojan Source attacks, beyond documenting it. -- Steve

On Tue, Nov 16, 2021 at 12:13 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:
The problems here are not Python's, they are code reviewers', and that means they're really attacks against the code review tools.
I think that's a bit strong. Boucher and Anderson's paper describes multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI attacks does seem to be a novel attack, and probably exploitable.
The BIDI attacks basically amount to making this: def func(): """This is a docstring"""; return look like this: def func(): """This is a docstring; return""" If you see something that looks like the second, but the word "return" is syntax-highlighted as a keyword instead of part of the string, the attack has failed. (Or if you ignore that, then your code review is flawed, and you're letting malicious code in.) The attack depends for its success on some human approving some piece of code that doesn't do what they think it does, and that means it has to look like what it doesn't do - which is an attack against what the code looks like, since what it does is very well defined.
Whereas the BIDI attacks do (apparently) make it easy to smuggle in code: using invisible BIDI control codes, you can introduce source code where the way the editor renders the code, and the way the coder reads it, is different from the way the interpreter or compiler runs it.
Right: the way the editor renders the code, that's the essential part. That's why I consider this an attack against some editor (or set of editors). When you find an editor that is vulnerable to this, file a bug report against that editor. The way the coder reads it will be heavily based upon the way the editor colours it.
That is, I think, new and exploitable: something that looks like a comment is actually code that the interpreter runs, and something that looks like code is actually a string or comment which is not executed, but editors may syntax-colour it as if it were code.
Right. Exactly my point: editors may syntax-colour it incorrectly. That's why I consider this not an attack on the language, but on the editor. As long as the editor parses it the exact same way that the interpreter does, there isn't a problem. ChrisA

Abdur-Rahmaan Janhangeer writes:
As a programmer, i don't want a language which bans unicode stuffs.
But that's what Unicode says should be done (see below).
If there's something that should be fixed, it's the unicode standard,
Unicode is not going to get "fixed". Most features are important for some natural language or other. One could argue that (for example) math symbols that are adopted directly from some character repertoire should not have been -- I did so elsewhere, although not terribly seriously.
maybe defining a sane mode where weird unicode stuffs are not allowed.
Unicode denies responsibility for that by permitting arbitrary subsetting. It does have a couple of (very broad) subsets predefined, ie, the normalization formats.

On Mon, Nov 15, 2021 at 03:20:26PM +0400, Abdur-Rahmaan Janhangeer wrote:
Well, it's not so obvious. From Ross Anderson and Nicholas Boucher src: https://trojansource.codes/trojan-source.pdf
Thanks for the link. But it discusses a whole range of Unicode attacks, and the specific attack you mentioned (Invisible Character Attacks) is described in section D page 7 as "unlikely to work in practice". As they say, compilers and interpreters in general already display errors, or at least a warning, for invisible characters in code. In addition, there is the difficulty that its not just enough to use invisible characters to call a different function, you have to smuggle in the hostile function that you actually want to call. It does seem that the Trojan-Source attack listed in the paper is new, but others (such as the homoglyph attacks that get most people's attention) are neither new nor especially easy to actually exploit. Unicode has been warning about it for many years. We discussed it in PEP 3131. This is not new, and not easy to exploit. Perhaps that's why there are no, or very few, actual exploits of this in the wild. Homoglyph attacks against user-names and URLs, absolutely, but homoglyph attacks against source code are a different story. Yes, you can cunningly have two classes like Α and A and the Python interpreter will treat them as distinct, but you still have to smuggle in your hostile code in Α (greek Alpha) without anyone noticing, and you have to avoid anyone asking why you have two classes with the same name. And that's the hard part. We don't need Unicode for homoglyph attacks. func0 and funcO may look identical, or nearly identical, but you still have to smuggle in your hostile code into funcO without anyone noticing, and that's why there are so few real-world homoglyph attacks. Whereas the Trojan Source attacks using BIDI controls does seem to be genuinely exploitable. -- Steve

On 11/15/2021 5:45 AM, Steven D'Aprano wrote:
In another thread, Serhiy already suggested we ban invisible control characters (other than whitespace) in comments and strings.
He said in string *literals*. One would put them in stromgs by using visible escape sequences.
'\033' is '\x1b' is '\u001b' True
https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A...
I think that is a good idea.
If one is outputting terminal control sequences, making the escape char visible is a good idea anyway. It would be easier if '\e' worked. (But see below.)
But beyond the C0 and C1 control characters, we should be conservative about banning "hidden characters" without a *concrete* threat. For example, variation selectors are "hidden", but they change the visual look of emoji and other characters. I can imagine that a complete emoji point and click input method might have one select the emoji and the variation and output the pair together. An option to output the selection character as the appropriate python-specific '\unnnn' is unlikely, and even if there were, who would know what it meant? Users would want the selected variation visible if the editor supported such.
If terminal escape sequences were also selected by point and click, my comment above would change. -- Terry Jan Reedy

On Mon, Nov 15, 2021 at 12:28:01PM -0500, Terry Reedy wrote:
On 11/15/2021 5:45 AM, Steven D'Aprano wrote:
In another thread, Serhiy already suggested we ban invisible control characters (other than whitespace) in comments and strings.
He said in string *literals*. One would put them in stromgs by using visible escape sequences.
Thanks Terry for the clarification, of course I didn't mean to imply that we should ban control characters in strings completely. Only actual control characters embedded in string literals in the source, just as we already currently ban them outside of comments and strings. -- Steve

Steven D'Aprano wrote:
I think that many editors in common use don't support bidirectional text, or at least the ones I use don't seem to support it fully or correctly. ... But, if there is a concrete threat beyond "it looks weird", that it another issue.
Based on the original post (and how it looked in my web browser, after various automated reformattings, it seems that one of the failure modes that buggy editors have is that stuff can be part of the code, even though it looks like part of a comment, or vice versa This problem might be limited to only some of the bidi controls, and there might even be a workaround specific to # ... but it is an issue. I do not currently have an opinion on how important of an issue it is, or how adequate the workarounds are. -jJ

On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓( 𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
0_o color me impressed, I did not think that would be legal syntax. Would be interesting to include in a textbook, if for nothing else other than to academically demonstrate that it is possible, as I suspect many are not aware. -- --Kyle R. Stanley, Python Core Developer (what is a core dev? <https://devguide.python.org/coredev/>) *Pronouns: they/them **(why is my pronoun here?* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> )

On Mon, Nov 15, 2021 at 8:42 AM Kyle Stanley <aeros167@gmail.com> wrote:
On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
[... Python code it's easy to believe isn't grammatical ...]
return ₛ
0_o color me impressed, I did not think that would be legal syntax. Would be interesting to include in a textbook, if for nothing else other than to academically demonstrate that it is possible, as I suspect many are not aware.
I'm afraid the best Paul, Alex, Anna and I can hope to do is bring it to the attention of readers of Python in a Nutshell's fourth edition (on current plans, hitting the shelves about the same time as 3.11, please tell your friends ;-) ). Sadly, I'm not aware of any academic classes that use the Nutshell as a course text, so it seems unlikely to gain the attention of academic communities. Given the wider reach of this list, however, one might hope that by the time the next edition comes out this will be old news due to the publication of blogs and the like. With luck, a small fraction of the programming community will become better-informed about Unicode and the design or programming languages. It's interesting that the egalitarian wish to allow use of native "alphabetics" has turned out to be such a viper's nest. Particular thanks to Stephen J. Turnbull for his thoughtful and well-informed contribution above. Kind regards, Steve

On Mon, Nov 29, 2021 at 1:21 AM Steve Holden <steve@holdenweb.com> wrote:
It's interesting that the egalitarian wish to allow use of native "alphabetics" has turned out to be such a viper's nest.
Indeed. However, is there no way to restrict identifiers at least to the alphabets of natural languages? Maybe it wouldn’t help much, but does anyone need to use letter-like symbols designed for math expressions? I would say maybe, but certainly not have them auto-converted to the “normal” letter? For that matter, why have any auto-conversion all? The answer may be that it’s too late to change now, but I don’t think I’ve seen a compelling (or any?) use case for that conversion. -CHB Particular thanks to Stephen J. Turnbull for his thoughtful and
well-informed contribution above.
Kind regards, Steve
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FNSI6EXC... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
participants (19)
-
Abdur-Rahmaan Janhangeer
-
Alex Martelli
-
Chris Angelico
-
Christopher Barker
-
Daniel Pope
-
David Mertz, Ph.D.
-
Eryk Sun
-
Jim J. Jewett
-
Kyle Stanley
-
Marc-Andre Lemburg
-
MRAB
-
Petr Viktorin
-
ptmcg@austin.rr.com
-
Richard Damon
-
Stephen J. Turnbull
-
Stestagg
-
Steve Holden
-
Steven D'Aprano
-
Terry Reedy