[Python-ideas] Visually confusable unicode characters in identifiers

Sun Sep 30 16:00:23 CEST 2012

Having just discovered that PEP 3131 [1] enables me to use greek letters to
represent variables in equations, it was pointed out to me that it also
allows
visually confusable characters in identifiers [2].

When I previously read the PEP I thought that the normalisation process
resolved these issues but now I see that the PEP leaves it as an open
problem.
I also previously thought that the PEP would be irrelevant if I was using
ascii-only code but now I can see that if a GREEK CAPITAL LETTER ALPHA can
sneak into my code (just like those pesky tab characters) I could still
have a
visually undetectable bug.

An example to show how an issue could arise:

"""
#!/usr/bin/env python3

code = '''
{0} = 123
{1} = 456
print('"{0}" == "{1}":', "{0}" == "{1}")
print('{0} == {1}:', {0} == {1})
'''

def test_identifier(identifier1, identifier2):
    exec(code.format(identifier1, identifier2))

test_identifier('\u212b', '\u00c5') # Different Angstrom code points
test_identifier('A', '\u0391') # LATIN/GREEK CAPITAL A/ALPHA
"""

When I run this I get:

$ ./test.py
"Å" == "Å": False
Å == Å: True
"A" == "Α": False
A == Α: False

Is the proposal mentioned in the PEP (to use something based on Unicode
Technical Standard #39 [3]) something that might be implemented at any
point?

Oscar

References:
[1] http://www.python.org/dev/peps/pep-3131/#open-issues
[2] http://article.gmane.org/gmane.comp.python.tutor/78116
[3] http://unicode.org/reports/tr39/#Confusable_Detection
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20120930/1ae6cc8d/attachment.html>