Mailman 3 Visually confusable unicode characters in identifiers - Python-ideas

Visually confusable unicode characters in identifiers

Oscar Benjamin

30 Sep 2012 30 Sep '12

9 a.m.

Having just discovered that PEP 3131 [1] enables me to use greek letters to represent variables in equations, it was pointed out to me that it also allows visually confusable characters in identifiers [2]. When I previously read the PEP I thought that the normalisation process resolved these issues but now I see that the PEP leaves it as an open problem. I also previously thought that the PEP would be irrelevant if I was using ascii-only code but now I can see that if a GREEK CAPITAL LETTER ALPHA can sneak into my code (just like those pesky tab characters) I could still have a visually undetectable bug. An example to show how an issue could arise: """ #!/usr/bin/env python3 code = ''' {0} = 123 {1} = 456 print('"{0}" == "{1}":', "{0}" == "{1}") print('{0} == {1}:', {0} == {1}) ''' def test_identifier(identifier1, identifier2): exec(code.format(identifier1, identifier2)) test_identifier('\u212b', '\u00c5') # Different Angstrom code points test_identifier('A', '\u0391') # LATIN/GREEK CAPITAL A/ALPHA """ When I run this I get: $ ./test.py "Å" == "Å": False Å == Å: True "A" == "Α": False A == Α: False Is the proposal mentioned in the PEP (to use something based on Unicode Technical Standard #39 [3]) something that might be implemented at any point? Oscar References: [1] http://www.python.org/dev/peps/pep-3131/#open-issues [2] http://article.gmane.org/gmane.comp.python.tutor/78116 [3] http://unicode.org/reports/tr39/#Confusable_Detection

Attachments:

attachment.htm (text/html — 2.3 KB)

Show replies by date

Steven D'Aprano

30 Sep 30 Sep

10:10 a.m.

New subject: Visually confusable unicode characters in identifiers

On 01/10/12 00:00, Oscar Benjamin wrote:

...

Having just discovered that PEP 3131 [1] enables me to use greek letters to represent variables in equations, it was pointed out to me that it also allows visually confusable characters in identifiers [2].

You don't need PEP 3131 to have visually confusable identifiers. MyObject = My0bject = "many fonts use the same glyph for O and 0" rn = m = 23 # try reading this in Ariel with a small font size x += l I don't think it's up to Python to protect you from arbitrarily poor choices in identifiers and typefaces, or against obfuscated code (whether deliberately so or by accident). Use of confusable identifiers is a code-quality issue, little different from any other code-quality issue: class myfunction: def __init__(a, b, c, d, e, f, g, h, i, j, k, l): a.b = b-e+k*h a.a = i + 1j*j a.l = ll + l1 + l a.somebodytoldmeishouldusemoredesccriptivevaraiblenames = g+d a.somebodytoldmeishouldusemoredesccribtivevaraiblenames = c+f You surely wouldn't expect Python to protect you from ignorant or obnoxious programmers who wrote code like that. I likewise don't think Python should protect you from programmers who do things like this: py> A = 42 py> Α = 23 py> A == Α False Besides, just because you and I can't distinguish A from Α in my editor, using one particular choice of font, doesn't mean that the author or his intended audience (Greek programmers perhaps?) can't distinguish them, using their editor and a more suitable typeface. The two characters are distinct using Courier or Lucinda Typewriter, to mention only two.

...

Is the proposal mentioned in the PEP (to use something based on Unicode Technical Standard #39 [3]) something that might be implemented at any point?

...

[3] http://unicode.org/reports/tr39/#Confusable_Detection

I would welcome "confusable detection" in the standard library, possibly a string method "skeleton" or some other interface to the Confusables file, perhaps in unicodedata. And I would encourage code checkers like PyFlakes, PyLint, PyChecker to check for confusable identifiers. But I do not believe that this should be built into the Python language itself. -- Steven

Jim Jewett

1 Oct 1 Oct

10:43 a.m.

New subject: Visually confusable unicode characters in identifiers

On 9/30/12, Steven D'Aprano <steve@pearwood.info> wrote:

...

On 01/10/12 00:00, Oscar Benjamin wrote:

...

py> A = 42 py> Α = 23 py> A == Α False

It will never be possible to catch all confusables, which is one reason that the unicode property stalled. It seems like it would be reasonable to at least warn when identifiers are not all in the same script -- but real-world examples from Emacs Lisp made it clear that this is often intentional. There were still clear word-boundaries, but it wasn't clear how that word-boundary detection could be properly automated in the general case.

...

Besides, just because you and I can't distinguish A from Α in my editor, using one particular choice of font, doesn't mean that the author or his intended audience (Greek programmers perhaps?) can't distinguish them,

In many cases, it does -- for the letters to look different requires an unnatural font choice, though perhaps not so extreme as the print-the-hex-code font.

...

I would welcome "confusable detection" in the standard library, possibly a string method "skeleton" or some other interface to the Confusables file, perhaps in unicodedata.

I would too, and agree that it shouldn't be limited to identifiers. -jJ

Mathias Panzenböck

11:07 a.m.

New subject: Visually confusable unicode characters in identifiers

I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere? On 10/01/2012 05:43 PM, Jim Jewett wrote:

...

On 9/30/12, Steven D'Aprano <steve@pearwood.info> wrote:

...
On 01/10/12 00:00, Oscar Benjamin wrote:

...
py> A = 42 py> Α = 23 py> A == Α False

It will never be possible to catch all confusables, which is one reason that the unicode property stalled.

It seems like it would be reasonable to at least warn when identifiers are not all in the same script -- but real-world examples from Emacs Lisp made it clear that this is often intentional. There were still clear word-boundaries, but it wasn't clear how that word-boundary detection could be properly automated in the general case.

...
Besides, just because you and I can't distinguish A from Α in my editor, using one particular choice of font, doesn't mean that the author or his intended audience (Greek programmers perhaps?) can't distinguish them,

In many cases, it does -- for the letters to look different requires an unnatural font choice, though perhaps not so extreme as the print-the-hex-code font.

...
I would welcome "confusable detection" in the standard library, possibly a string method "skeleton" or some other interface to the Confusables file, perhaps in unicodedata.

I would too, and agree that it shouldn't be limited to identifiers.

-jJ

Chris Angelico

11:19 a.m.

New subject: Visually confusable unicode characters in identifiers

On Tue, Oct 2, 2012 at 2:07 AM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

Same reason you're allowed more than two letters in your identifiers: to allow programmers to make variable names meaningful. The problem isn't with Unicode, anyway; there are plenty of fonts in which l and 1 are practically identical, and unless your font is monospaced, you probably will have trouble distinguishing __________rn___ from __________m___ (just how many underscores IS that?). It's up to the programmer to be smart about his names. ChrisA

Robert Kern

11:43 a.m.

New subject: Visually confusable unicode characters in identifiers

On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...

I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Mathias Panzenböck

12:02 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 06:43 PM, Robert Kern wrote:

...

On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Guido van Rossum

12:44 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, Oct 1, 2012 at 10:02 AM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

...

From your name and email it sounds like your native language might be German. Like me, you probably take pride in your English skills and

Hi Matthias, Your objections go pretty much exactly along the lines of my original resistance to this proposal (which was proposed many times before it got to be a PEP). What finally made me change my mind was talking to educators who were teaching Python in countries where not only English is not the primary language, the primary language is not even related to English. (E.g. Chinese or Japanese.) Teaching the students the necessary language keywords and standard library names is not that difficult; even if English *is* your primary language you have to learn what they mean in the context of programming. (Example: "print" comes from a very ancient mode of using computers where the only form of output was through a physical printer.) But these students often have a very limited English vocabulary, and their science and math classes (which are often useful starting points for programming exercises) are usually taught in the native language. So when teachers show students example programs it helps if they can name e.g. their variables and functions in the native language. Comments are also often written in the native language. Here, it really helps if the students can type their native language directly rather than having to use the Latin transcription (even if they often also have to learn the latter, for unrelated pragmatic reasons). like me, you write all your code using English for identifiers and comments. However, for students just learning to program and not yet well-versed in English, that would be like trying to teach them multiple things at once. It may work for the smartest students, but it probably would be unnecessarily off-putting for many others. As an example in German, I found a Python book aimed at middle- and high-schoolers written in German, Python für Kids. You can look inside it on the Amazon website: http://www.amazon.com/Python-f%C3%BCr-Kids/dp/3826609514#reader_3826609514 -- the examples use German words for most module and variable names. Luckily German limited to ASCII is still fairly readable ("fuer" instead of "für" etc.), so Unicode is not strictly needed for this case -- but you can understand that in languages whose native alphabet is not English, Unicode is essential for the same style of introduction. I'm sure there are also examples beyond education -- e.g. in a program for calculating dutch taxes I would use the dutch names for the various technical terms naming concepts in dutch tax law, and again, in the case of the Dutch language that doesn't require Unicode, but for many other languages it would. I hope this helps. (Also note, as the PEP states explicitly, that the Python standard library should use only ASCII and English for identifiers and comments, except in those unittests that are specifically testing the Unicode identifiers feature.) -- --Guido van Rossum (python.org/~guido)

Antoine Pitrou

1:04 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, 1 Oct 2012 10:44:42 -0700 Guido van Rossum <guido@python.org> wrote:

...

As an example in German, I found a Python book aimed at middle- and high-schoolers written in German, Python für Kids. You can look inside it on the Amazon website: http://www.amazon.com/Python-f%C3%BCr-Kids/dp/3826609514#reader_3826609514

Oh but why isn't it named Python für Kinder? :-) Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

Guido van Rossum

1:10 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, Oct 1, 2012 at 11:04 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Mon, 1 Oct 2012 10:44:42 -0700 Guido van Rossum <guido@python.org> wrote:

...
As an example in German, I found a Python book aimed at middle- and high-schoolers written in German, Python für Kids. You can look inside it on the Amazon website: http://www.amazon.com/Python-f%C3%BCr-Kids/dp/3826609514#reader_3826609514

Oh but why isn't it named Python für Kinder? :-)

Probably to be "cool" for the "kids". Why is a mobile phone in Germany called a "Handy" ? -- --Guido van Rossum (python.org/~guido)

Jakob Bowyer

1:12 p.m.

New subject: Visually confusable unicode characters in identifiers

Because it fits in your hand? And its handy? :) On Mon, Oct 1, 2012 at 7:10 PM, Guido van Rossum <guido@python.org> wrote:

...

On Mon, Oct 1, 2012 at 11:04 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Mon, 1 Oct 2012 10:44:42 -0700 Guido van Rossum <guido@python.org> wrote:

...
As an example in German, I found a Python book aimed at middle- and high-schoolers written in German, Python für Kids. You can look inside it on the Amazon website: http://www.amazon.com/Python-f%C3%BCr-Kids/dp/3826609514#reader_3826609514

Oh but why isn't it named Python für Kinder? :-)

Probably to be "cool" for the "kids". Why is a mobile phone in Germany called a "Handy" ?

-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Georg Brandl

3:04 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 08:10 PM, Guido van Rossum wrote:

...

On Mon, Oct 1, 2012 at 11:04 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Mon, 1 Oct 2012 10:44:42 -0700 Guido van Rossum <guido@python.org> wrote:

...
As an example in German, I found a Python book aimed at middle- and high-schoolers written in German, Python für Kids. You can look inside it on the Amazon website: http://www.amazon.com/Python-f%C3%BCr-Kids/dp/3826609514#reader_3826609514

Oh but why isn't it named Python für Kinder? :-)

Probably to be "cool" for the "kids". Why is a mobile phone in Germany called a "Handy" ?

And why, oh why, do we have to buy our bread rolls at a "Backshop" nowadays... Georg

Greg Ewing

6:24 p.m.

Antoine Pitrou wrote:

...

Oh but why isn't it named Python für Kinder? :-)

It looks like Germans have adopted "kid" as an abbreviation for "kinder", just like we use it as an abbreviation for "child". Or maybe we got it from them -- it's closer to their original word than ours! They seem to be using our plural, though -- "kids", not "kidden"... -- Greg

Mathias Panzenböck

7:06 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/02/2012 01:24 AM, Greg Ewing wrote:

...

Antoine Pitrou wrote:

...
Oh but why isn't it named Python für Kinder? :-)

It looks like Germans have adopted "kid" as an abbreviation for "kinder", just like we use it as an abbreviation for "child". Or maybe we got it from them -- it's closer to their original word than ours!

They seem to be using our plural, though -- "kids", not "kidden"...

Sometimes we use the ...s for plural as well, especially for acronyms, words of English or French origin and last names. But it would not be ...en, maybe ...er. Is there any German word that uses ...en for plural? I don't think so. Anyway, "kids" is definitely an anglicism, because we pronounce it "English" and not like it would be pronounced if it where derived from "Kind" (it would be more like "keed"). German today is full of anglicisms. But then, there are some German words used by English people as well: gesundheit, kindergarten, über, blitz(krieg), angst (used as something different as the German word), abseiling ("abseilen" in German), doppelgänger, gestalt, poltergeist, Zeitgeist...

Greg Ewing

2 Oct 2 Oct

12:09 a.m.

Mathias Panzenböck wrote:

...

But it would not be ...en, maybe ...er. Is there any German word that uses ...en for plural? I don't think so.

This page seems to think that some do: http://german.about.com/od/grammar/a/PluralNounsWithnENEndings.htm -- Greg

Georg Brandl

1 Oct 1 Oct

12:48 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 07:02 PM, Mathias Panzenböck wrote:

...

On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Please; the PEP has been discussed quite a lot when it was proposed, and believe me, yours is not an unfamiliar argument :) You're about 5 years late. Georg

Mathias Panzenböck

2:33 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 07:48 PM, Georg Brandl wrote:

...

On 10/01/2012 07:02 PM, Mathias Panzenböck wrote:

...
On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Please; the PEP has been discussed quite a lot when it was proposed, and believe me, yours is not an unfamiliar argument :) You're about 5 years late.

Georg

I didn't want to start a discussion. I just wanted to know why one would implement such a language feature. Guido's answer cleared it up for me, thanks. I can see the purpose in an educational setting (not in production code of anything a little bit bigger). -panzi

Georg Brandl

3:03 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 09:33 PM, Mathias Panzenböck wrote:

...

On 10/01/2012 07:48 PM, Georg Brandl wrote:

...
On 10/01/2012 07:02 PM, Mathias Panzenböck wrote:

...
On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Please; the PEP has been discussed quite a lot when it was proposed, and believe me, yours is not an unfamiliar argument :) You're about 5 years late.

Georg

I didn't want to start a discussion. I just wanted to know why one would implement such a language feature.

Well, in that case I would have said "read the PEP": I think it's well explained there. Georg

Oscar Benjamin

3:26 p.m.

New subject: Visually confusable unicode characters in identifiers

On 1 October 2012 20:33, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...

On 10/01/2012 07:48 PM, Georg Brandl wrote:

...
On 10/01/2012 07:02 PM, Mathias Panzenböck wrote:

...
On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Please; the PEP has been discussed quite a lot when it was proposed, and believe me, yours is not an unfamiliar argument :) You're about 5 years late.

Georg

I didn't want to start a discussion. I just wanted to know why one would implement such a language feature. Guido's answer cleared it up for me, thanks. I can see the purpose in an educational setting (not in production code of anything a little bit bigger).

Non-ascii identifiers have other possible uses. I'll repost the case that started this discussion on python-tutor (attached in case it doesn't display): ''' #!/usr/bin/env python3 # -*- encoding: utf-8 -*- # Parameters α = 1 β = 0.1 γ = 1.5 δ = 0.075 # Initial conditions xₒ = 10 yₒ = 5 Zₒ = xₒ, yₒ # Solution parameters tₒ = 0 Δt = 0.001 T = 10 # Lotka-Volterra derivative def f(Z, t): x, y = Z ẋ = x * (α - β*y) ẏ = -y * (γ - δ*x) return ẋ, ẏ # Accumulate results from Euler stepper tᵢ = tₒ Zᵢ = Zₒ Zₜ, t = [], [] while tᵢ <= tₒ + T: Zₜ.append(Zᵢ) t.append(tᵢ) Zᵢ = [Zᵢⱼ+ Δt*Żᵢⱼ for Zᵢⱼ, Żᵢⱼ in zip(Zᵢ, f(Zᵢ, tᵢ))] tᵢ += Δt # Output since I don't have plotting libraries in Python 3 print('t', 'x', 'y') for tᵢ, (xᵢ, yᵢ) in zip(t, Zₜ): print(tᵢ, xᵢ, yᵢ) ''' Oscar

Guido van Rossum

3:51 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, Oct 1, 2012 at 1:26 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...

On 1 October 2012 20:33, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:

...
On 10/01/2012 07:48 PM, Georg Brandl wrote:

...
On 10/01/2012 07:02 PM, Mathias Panzenböck wrote:

...
On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

But the Python keywords and more importantly the documentation is English. Don't you need to be able to speak/write English in order to code Python anyway? And if you keep you code+comments English you can access a much larger developer pool (all developers who speak English should by my hypothesis be a superset of all developers who speak a certain language).

Please; the PEP has been discussed quite a lot when it was proposed, and believe me, yours is not an unfamiliar argument :) You're about 5 years late.

Georg

I didn't want to start a discussion. I just wanted to know why one would implement such a language feature. Guido's answer cleared it up for me, thanks. I can see the purpose in an educational setting (not in production code of anything a little bit bigger).

Non-ascii identifiers have other possible uses. I'll repost the case that started this discussion on python-tutor (attached in case it doesn't display):

''' #!/usr/bin/env python3 # -*- encoding: utf-8 -*-

# Parameters α = 1 β = 0.1 γ = 1.5 δ = 0.075

# Initial conditions xₒ = 10 yₒ = 5 Zₒ = xₒ, yₒ

# Solution parameters tₒ = 0 Δt = 0.001 T = 10

# Lotka-Volterra derivative def f(Z, t): x, y = Z ẋ = x * (α - β*y) ẏ = -y * (γ - δ*x) return ẋ, ẏ

# Accumulate results from Euler stepper tᵢ = tₒ Zᵢ = Zₒ Zₜ, t = [], [] while tᵢ <= tₒ + T: Zₜ.append(Zᵢ) t.append(tᵢ) Zᵢ = [Zᵢⱼ+ Δt*Żᵢⱼ for Zᵢⱼ, Żᵢⱼ in zip(Zᵢ, f(Zᵢ, tᵢ))] tᵢ += Δt

# Output since I don't have plotting libraries in Python 3 print('t', 'x', 'y') for tᵢ, (xᵢ, yᵢ) in zip(t, Zₜ): print(tᵢ, xᵢ, yᵢ) '''

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers? -- --Guido van Rossum (python.org/~guido)

Andre Roberge

3:55 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, Oct 1, 2012 at 5:51 PM, Guido van Rossum <guido@python.org> wrote:

...

On Mon, Oct 1, 2012 at 1:26 PM, Oscar Benjamin

SNIP

...

...
Non-ascii identifiers have other possible uses. I'll repost the case that started this discussion on python-tutor (attached in case it doesn't display):

''' #!/usr/bin/env python3 # -*- encoding: utf-8 -*-

# Parameters α = 1 β = 0.1 γ = 1.5 δ = 0.075

# Initial conditions xₒ = 10 yₒ = 5 Zₒ = xₒ, yₒ

SNIP

...

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

+1000 André Roberge

...

-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Oscar Benjamin

4:46 p.m.

New subject: Visually confusable unicode characters in identifiers

On 1 October 2012 21:51, Guido van Rossum <guido@python.org> wrote:

...

On Mon, Oct 1, 2012 at 1:26 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...
# Parameters α = 1 β = 0.1 γ = 1.5 δ = 0.075

# Initial conditions xₒ = 10 yₒ = 5 Zₒ = xₒ, yₒ

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

Sympy already has a few of the basic TeX concepts. I imagine that something like Sympy notebooks (a browser-based interface) might one day gain support for this. A readline-ish method to do it would be a great extension to isympy (since it already works for output): $ isympy IPython console for SymPy 0.7.1.rc1 (Python 2.7.3-64-bit) (ground types: python) In [1]: Symbol('beta') Out[1]: β In [2]: Symbol('c_1') Out[2]: c₁ Oscar

Georg Brandl

4:54 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/01/2012 10:51 PM, Guido van Rossum wrote:

...

On Mon, Oct 1, 2012 at 1:26 PM, Oscar Benjamin

...
Non-ascii identifiers have other possible uses. I'll repost the case that started this discussion on python-tutor (attached in case it doesn't display):

Very nice!

...

...
''' #!/usr/bin/env python3 # -*- encoding: utf-8 -*-

# Parameters α = 1 β = 0.1 γ = 1.5 δ = 0.075

# Initial conditions xₒ = 10 yₒ = 5 Zₒ = xₒ, yₒ

# Solution parameters tₒ = 0 Δt = 0.001 T = 10

# Lotka-Volterra derivative def f(Z, t): x, y = Z ẋ = x * (α - β*y) ẏ = -y * (γ - δ*x) return ẋ, ẏ

# Accumulate results from Euler stepper tᵢ = tₒ Zᵢ = Zₒ Zₜ, t = [], [] while tᵢ <= tₒ + T: Zₜ.append(Zᵢ) t.append(tᵢ) Zᵢ = [Zᵢⱼ+ Δt*Żᵢⱼ for Zᵢⱼ, Żᵢⱼ in zip(Zᵢ, f(Zᵢ, tᵢ))] tᵢ += Δt

# Output since I don't have plotting libraries in Python 3 print('t', 'x', 'y') for tᵢ, (xᵢ, yᵢ) in zip(t, Zₜ): print(tᵢ, xᵢ, yᵢ) '''

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

With the right editor, of course, it's not a problem :) (Emacs has a TeX input method with which I could type this example without problems.) Georg

Matthew Woodcraft

5:28 p.m.

New subject: Visually confusable unicode characters in identifiers

On 2012-10-01 21:51, Guido van Rossum wrote:

...

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

I think that's up to the OS or the text editor. In Emacs, this works: M-x set-input-method tex -M-

Ben Finney

11:25 p.m.

New subject: Visually confusable unicode characters in identifiers

Matthew Woodcraft <matthew@woodcraft.me.uk> writes:

...

On 2012-10-01 21:51, Guido van Rossum wrote:

...
Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

I think that's up to the OS or the text editor.

Agreed. Make of these identifiers will need to be typed at an OS command line, after all (e.g. for naming a test case to run, as one which springs easily to mind). Solve the keyboard input problem in the OS layer – as someone who anticipates working with non-ASCII characters must already do – and you solve it for Python code as well. I don't think it's Python's business to get involved at the input method level. -- \ “The apparent lesson of the Inquisition is that insistence on | `\ uniformity of belief is fatal to intellectual, moral, and | _o__) spiritual health.” —_The Uses Of The Past_, Herbert J. Muller | Ben Finney

Stephen J. Turnbull

2 Oct 2 Oct

3:04 a.m.

New subject: Visually confusable unicode characters in identifiers

Ben Finney writes:

...

Solve the keyboard input problem in the OS layer – as someone who anticipates working with non-ASCII characters must already do – and you solve it for Python code as well.

That simply isn't true for symbol characters and Greek letters. I still let either TeX or XEmacs translate TeX macros for me. I don't even know how to type an integral sign in Mac OS X Terminal (conveniently, that is -- of course there's always the character palette), and if I wanted directed quotation marks (I don't), I'd just use ASCII quotes and let XEmacs translate those, too. There ought to be a standard way to get those symbols and punctuation, preferably ASCII-based, on any terminal, using the standard Python interpreter.

Ben Finney

6:39 a.m.

New subject: Visually confusable unicode characters in identifiers

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

...

I still let either TeX or XEmacs translate TeX macros for me. I don't even know how to type an integral sign in Mac OS X Terminal (conveniently, that is -- of course there's always the character palette), and if I wanted directed quotation marks (I don't), I'd just use ASCII quotes and let XEmacs translate those, too.

Right. So you've solved it for one program only, not the OS which is (or should be) responsible for turning what you type into characters, uniformly across all applications you have keyboard input for.

...

There ought to be a standard way to get those symbols and punctuation, preferably ASCII-based, on any terminal

Definitely agreed with this. Indeed, it's my point: the problem should be solved in one place for the user of the computer, not separately per application or framework.

...

using the standard Python interpreter.

If you mean that the Python interpreter should be aware of the solution, why? That's solving it at the wrong level, because any non-Python program (such as a shell or an editor) gets no benefit from that. If you mean that the single, one-point solution should work across all programs, including the standard Python interpreter, then yes I agree. I'm saying the OS is the right place to solve it, by installing an appropriate input method (or whatever each OS calls them). -- \ “In economics, hope and faith coexist with great scientific | `\ pretension and also a deep desire for respectability.” —John | _o__) Kenneth Galbraith, 1970-06-07 | Ben Finney

Stephen J. Turnbull

3 Oct 3 Oct

12:31 a.m.

New subject: Visually confusable unicode characters in identifiers

Ben Finney writes:

...

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

...
I still let either TeX or XEmacs translate TeX macros for me. I don't even know how to type an integral sign in Mac OS X Terminal (conveniently, that is -- of course there's always the character palette), and if I wanted directed quotation marks (I don't), I'd just use ASCII quotes and let XEmacs translate those, too.

Right. So you've solved it for one program only, not the OS

You seem to be under a misconception. Emacs *is* an OS, it just runs on top of the more primitive OSes normally associated with the term. ;-)

...

I'm saying the OS is the right place to solve it, by installing an appropriate input method (or whatever each OS calls them).

I doubt very many people used to and fond of LaTeX would agree with you, since AFAIK there aren't any OSes providing TeX macros as an input method. AFAICS it's not available on my Mac. While I don't particularly favor it, it may be the best compromise, as many people are familiar with it, and many many symbols are available with familiar, intuitive names so that non-TeXnical typists can often guess them.

Ben Finney

5:20 p.m.

New subject: Visually confusable unicode characters in identifiers

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

...

Ben Finney writes:

...
Right. So you've solved it for one program only, not the OS

You seem to be under a misconception. Emacs *is* an OS […]

… all it needs is a good editor? :-) (I'm claiming permission for that snark because Emacs is my primary editor.)

...

...
I'm saying the OS is the right place to solve it, by installing an appropriate input method (or whatever each OS calls them).

I doubt very many people used to and fond of LaTeX would agree with you, since AFAIK there aren't any OSes providing TeX macros as an input method.

I've shown several LaTeX-comfortable people IBus on GNOME and/or KDE (for GNU+Linux), and they were very glad that it has a LaTeX input method. So anyone who is fond of LaTeX and has IBus or an equivalent input method engine on their OS can agree.

...

AFAICS it's not available on my Mac.

That's a shame. Maybe some OS vendors don't want to support users extending the OS functionality? Or maybe your OS does have such a thing available. I haven't been motivated to look for it.

...

While I don't particularly favor it, it may be the best compromise, as many people are familiar with it, and many many symbols are available with familiar, intuitive names so that non-TeXnical typists can often guess them.

Agreed. Which is why I advocate installing such an input method in one's OS input method engine, so that input method is available for all applications. -- \ “I thought I'd begin by reading a poem by Shakespeare, but then | `\ I thought ‘Why should I? He never reads any of mine.’” —Spike | _o__) Milligan | Ben Finney

Stephen J. Turnbull

4 Oct 4 Oct

10:11 p.m.

New subject: Visually confusable unicode characters in identifiers

Ben Finney writes:

...

I've shown several LaTeX-comfortable people IBus on GNOME and/or KDE (for GNU+Linux), and they were very glad that it has a LaTeX input method.

I'm happy to be proved wrong!

...

...
AFAICS it's not available on my Mac.

That's a shame. Maybe some OS vendors don't want to support users extending the OS functionality? Or maybe your OS does have such a thing available. I haven't been motivated to look for it.

I have looked for it; if it's available on Mac OS X, it's not easy to find. I suspect the same is true for Windows.

...

Agreed. Which is why I advocate installing such an input method in one's OS input method engine, so that input method is available for all applications.

Whatever makes you think I don't? That's *exactly* why I live in XEmacs, because it provides me with a portable environment for mixing English and math with a language whose orthography puts Brainf*ck syntax to shame. But pragmatically speaking, Unicode support is a sore point for Python. "Screw you if you don't know how to conveniently input integral signs on your OS" is not a message we want to be sending.

Stephen J. Turnbull

1 Oct 1 Oct

11:11 p.m.

New subject: Visually confusable unicode characters in identifiers

Guido van Rossum writes:

...

Those examples would be a lot more compelling if there was an acceptable way to input those characters.

Hey!! What's "unacceptable" about Emacs??<duck/>

...

Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

If you're talking about interactive use, Emacs has a method based on searching the Unicode character database. LaTeX math notation has a number of potential pitfalls. In particular, the sub-/superscript notation can be applied to anything, not just characters that happen to have *script versions in Unicode. Also, not everything that seems to a character in LaTeX necessarily has a corresponding Unicode character.

Serhiy Storchaka

2 Oct 2 Oct

5:43 a.m.

New subject: Visually confusable unicode characters in identifiers

On 01.10.12 23:51, Guido van Rossum wrote:

...

Those examples would be a lot more compelling if there was an acceptable way to input those characters. Maybe we could support some kind of input method that enabled LaTeX style math notation as used by scientists for writing equations in papers?

\u03B1 Java already allows this outside of the string literals. And it sometimes causes unpleasant effects.

Terry Reedy

1 Oct 1 Oct

1:21 p.m.

New subject: Visually confusable unicode characters in identifiers

On 10/1/2012 1:02 PM, Mathias Panzenböck wrote:

...

On 10/01/2012 06:43 PM, Robert Kern wrote:

...
On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

I have the impression that latin-1 chars were/are (unofficially) accepted in Python2.

...

But the Python keywords and more importantly the documentation is English.

I know of at least one translation http://docs.python.org.ar/tutorial/contenido.html though keeping up with changes is obvious a problem. There are multiple books in multiple languages. When I went to a bookstore in Japan, the program languages sections had about 8 for Python. I suspect that is more than most equivalent US bookstores. -- Terry Jan Reedy

Massimo DiPierro

12:18 p.m.

New subject: Visually confusable unicode characters in identifiers

The great thing about open source is that is has brought the world together. I am not an english speaker and I learned the meaning of IF, THEN, FOR, WHILE, not in the context of the English language, but as keywords of the Basic programming language. The fact that they are english words has is accidental. The great thing about code is (used to be) that anybody can read and understand what others write. When I used program in Italy, I had to deal with latin-1 characters. This was never a problem. Not even in Cobol, Basic, Clipper, or Paradox because data should be separated from code. Data may contain latin-1 or unicode or whatever. Code always contains ASCII and if one does not mix the two there is never a problem. Allowing unicode in variable names blurs this separation. It makes code written people speaking one language unreadable by people speaking a different language. I should point out that most of my students are Chinese. They do not have any problem with reading and writing code using the english alphabet. Any one of us could design better power plugs for our homes. That does not mean it would be a good idea to do so. Massimo On Oct 1, 2012, at 11:43 AM, Robert Kern wrote:

...

On 10/1/12 5:07 PM, Mathias Panzenböck wrote:

...
I still don't understand why unicode characters are allowed at all in identifier names. Is the reason for this written down somewhere?

http://www.python.org/dev/peps/pep-3131/#rationale

-- Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Guido van Rossum

12:51 p.m.

New subject: Visually confusable unicode characters in identifiers

On Mon, Oct 1, 2012 at 10:18 AM, Massimo DiPierro <massimo.dipierro@gmail.com> wrote:

...

The great thing about open source is that is has brought the world together. I am not an english speaker and I learned the meaning of IF, THEN, FOR, WHILE, not in the context of the English language, but as keywords of the Basic programming language. The fact that they are english words has is accidental. The great thing about code is (used to be) that anybody can read and understand what others write.

When I used program in Italy, I had to deal with latin-1 characters. This was never a problem. Not even in Cobol, Basic, Clipper, or Paradox because data should be separated from code. Data may contain latin-1 or unicode or whatever. Code always contains ASCII and if one does not mix the two there is never a problem.

Allowing unicode in variable names blurs this separation. It makes code written people speaking one language unreadable by people speaking a different language.

I should point out that most of my students are Chinese. They do not have any problem with reading and writing code using the english alphabet.

Any one of us could design better power plugs for our homes. That does not mean it would be a good idea to do so.

Our posts crossed. I hope my explanation makes sense to you. The age / grade level of students probably matters; all classes in middle or high school are typically taught in the native language, but in University more and more courses are taught in English (some European countries are even making English the mandatory teachkng language at the University level). Not everything you design is meant to be a better power plug for the world. Sometimes you just need to find a way to fit *your* oven in *your* cabinet, and cutting up some planks in a way that wouldn't work for anyone else is fine. -- --Guido van Rossum (python.org/~guido)

Massimo DiPierro

2:29 p.m.

New subject: Visually confusable unicode characters in identifiers

Hello Guido, it does make sense. The only point I tried to make is that, because something is allowed, it does mean it should be encouraged. I am sure there are instructors who want to teach to code using Japanese of Chinese variable names. Python gives them a way to do so. Yet, if they do so, they would be isolating their students and their code from the rest of the world. Massimo On Oct 1, 2012, at 12:51 PM, Guido van Rossum wrote:

...

On Mon, Oct 1, 2012 at 10:18 AM, Massimo DiPierro <massimo.dipierro@gmail.com> wrote:

...
The great thing about open source is that is has brought the world together. I am not an english speaker and I learned the meaning of IF, THEN, FOR, WHILE, not in the context of the English language, but as keywords of the Basic programming language. The fact that they are english words has is accidental. The great thing about code is (used to be) that anybody can read and understand what others write.

When I used program in Italy, I had to deal with latin-1 characters. This was never a problem. Not even in Cobol, Basic, Clipper, or Paradox because data should be separated from code. Data may contain latin-1 or unicode or whatever. Code always contains ASCII and if one does not mix the two there is never a problem.

Allowing unicode in variable names blurs this separation. It makes code written people speaking one language unreadable by people speaking a different language.

I should point out that most of my students are Chinese. They do not have any problem with reading and writing code using the english alphabet.

Any one of us could design better power plugs for our homes. That does not mean it would be a good idea to do so.

Our posts crossed. I hope my explanation makes sense to you. The age / grade level of students probably matters; all classes in middle or high school are typically taught in the native language, but in University more and more courses are taught in English (some European countries are even making English the mandatory teachkng language at the University level).

Not everything you design is meant to be a better power plug for the world. Sometimes you just need to find a way to fit *your* oven in *your* cabinet, and cutting up some planks in a way that wouldn't work for anyone else is fine.

-- --Guido van Rossum (python.org/~guido)

Nick Coghlan

2:37 p.m.

New subject: Visually confusable unicode characters in identifiers

On Tue, Oct 2, 2012 at 12:59 AM, Massimo DiPierro <massimo.dipierro@gmail.com> wrote:

...

Hello Guido,

it does make sense. The only point I tried to make is that, because something is allowed, it does mean it should be encouraged. I am sure there are instructors who want to teach to code using Japanese of Chinese variable names. Python gives them a way to do so. Yet, if they do so, they would be isolating their students and their code from the rest of the world.

Only if they *stop* there. The idea is just to allow the learning curve to be made gentler - as people learn the standard library and the tools on PyPI, then yes, it will still be necessary to continue learning English in order to make use of those tools (especially as many of them won't have translated documentation). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Steven D'Aprano

8:32 p.m.

New subject: Visually confusable unicode characters in identifiers

On 02/10/12 05:29, Massimo DiPierro wrote:

...

it does make sense. The only point I tried to make is that, because something is allowed, it does mean it should be encouraged. I am sure there are instructors who want to teach to code using Japanese of Chinese variable names. Python gives them a way to do so. Yet, if they do so, they would be isolating their students and their code from the rest of the world.

People very often over-estimate the cost of that isolation, and over-value access to the rest of the world. The average open source piece of software has one, maybe two, contributors. What do they care if millions of English-speaking programmers can't contribute when they weren't going to contribute regardless of the language? Perhaps the convenience of being able to read your own code in your own native language outweighs the loss of being able to attract contributors that you can't even talk to. And for proprietary software, again it is irrelevant. If a Chinese company writes Chinese software for Chinese users with Chinese developers, why would they want to write it in English? Perhaps they have little choice due to the overwhelming trend towards English in programming languages, but there's no positive benefit to using a non-native language. Quite frankly, and I'm saying this as somebody who only speaks English, I think that the use of English as the single lingua franca of computer programming is as unnecessary (and ultimately as harmful) as the use of Latin and then French as the sole lingua franca of science and mathematics. I expect that it too will be a passing phase. By the way, are you familiar with ChinesePython and IronPerunis? http://www.chinesepython.org/english/english.html http://ironperunis.codeplex.com/ -- Steven

Stephen J. Turnbull

10:48 p.m.

New subject: Visually confusable unicode characters in identifiers

Mathias Panzenböck writes:

...

I still don't understand why unicode characters are allowed at all in identifier names.

"Consenting adults." 'nuff said? An anecdote. Back when I was first learning Japanese, I maintained an Emacs interface to EDICT, a free Japanese-English dictionary. The code was smart enough to parse morphosyntax (inflection of verbs and adjectives) into dictionary forms, but I wasn't (and according to my daughter, still am not<wink/>). So I asked my tutor for help. Although a total non-programmer, he was able to read the grammar easily because the state names (identifiers for callable objects) were written in Japanese, using the standard grammatical name for the inflection. The "easy" part comes in because although his English was good, it wasn't good enough to disentangle Lisp gobbledygook from the morphosyntax data had it been written in ASCII. But he was able to read and comment on the whole grammar in about half an hour because he could just skip *all* the ASCII!

4484

Age (days ago)

4489

Last active (days ago)

List overview

Download

38 comments

19 participants

participants (19)

Andre Roberge
Antoine Pitrou
Ben Finney
Chris Angelico
Georg Brandl
Greg Ewing
Guido van Rossum
Jakob Bowyer
Jim Jewett
Massimo DiPierro
Mathias Panzenböck
Matthew Woodcraft
Nick Coghlan
Oscar Benjamin
Robert Kern
Serhiy Storchaka
Stephen J. Turnbull
Steven D'Aprano
Terry Reedy

Visually confusable unicode characters in identifiers

Mathias Panzenböck

Mathias Panzenböck

Jakob Bowyer

Mathias Panzenböck

Mathias Panzenböck

tags

participants (19)