I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.

 

As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).

 

To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:

 

def 𝚑𝓮𝖑𝒍𝑜():

    try:

        𝔥e𝗅𝕝𝚘 = "Hello"

        𝕨𝔬r𝓵 = "World"

        ᵖ𝖗𝐢𝘯𝓽(f"{𝗵𝓵𝔩º_}, {𝖜𝒓l}!")

    except 𝓣𝕪𝖤𝗿𝖔𝚛 as ⅇ𝗑c:

        𝒑rₙₜ("failed: {}".𝕗𝗼ʳᵐª(ᵉ𝐱𝓬))

 

if _𝓪𝑚𝕖__ == "__main__":

    𝒉eℓˡ𝗈()

 

 

# snippet from unittest/util.py

_𝓟𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽𝕷𝔼𝗡 = 12

def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢𝖝𝕝𝚎𝑛, 𝑓𝗳𝗂𝑥𝗹𝚗):

    ˢ𝗸𝗽 = 𝐥𝘯(𝖘) - r𝚎𝖋𝐢x𝗅𝓷 - 𝒔𝙪𝘅𝗹𝙚ₙ

    if si𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺𝕯𝙀𝘙L𝔈𝒩:

        𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡𝘯:])

    return

 

 

You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.

 

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com). I have a second gist containing the transformer, but it is still a private gist atm.)

 

 

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.

 

 

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.

 

 

-- Paul McGuire