[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2 Nov 2021

      On 01. 11. 21 13:17, Petr Viktorin wrote:
...
Hello,
Today, an attack called "Trojan source" was revealed, where a malicious 
contributor can use Unicode features (left-to-right text and homoglyphs) 
to code that, when shown in an editor, will look different from how a 
computer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report 
and decided that it should be handled in code editors, diff viewers, 
repository frontends and similar software, rather than in the language.
I agree: in my opinion, the attack is similar to abusing any other 
"gotcha" where Python doesn't parse text as a non-expert human would. 
For example: `if a or b == 'yes'`, mutable default arguments, or a 
misleading typo.
Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.
Thanks for the comments, everyone! I've updated the document and sent it 
to https://github.com/python/peps/pull/2129
A rendered version is at 
https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst

Toshio Kuratomi wrote:
...
`Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.
Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and 
encodings -- that's much too technical for this document. Using the 
specific technical terms unfortunately doesn't help understanding, so I 
use the vague ones like "character" and "letter".)

Jim J. Jewett wrote:
...
...
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement."
Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions.  (XID_CONTINUE instead of XID_START)  The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own.
I'm not a native speaker, but as I understand it, "十" is closer to a 
single-letter word than a single-digit number. It translates better as 
"ten" than "10". (And it appears in "十四", "fourteen", just like "four" 
appears in "fourteen".)

Patrick Schultz wrote:
...
- The Unicode consortium has a list of confusables, in case useful
Yup, and it's linked from the documents that describe how to use it. I 
link to those rather than just the list.
But thank you!

Terry Reedy wrote:
...
...
...
Bidirectional Text
------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]
There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below).  The handling depends on the display software and may depend on the quoting.  Tk and hence tkinter (and IDLE) text widgets do local handing.  Windows Notepad++ does local handling of unquoted code but extending handling of quoted text.  Windows Notepad currently does extended handling even without quotes.
I'd like to leave these details out of the document. The examples should 
render convincingly in browsers. The text should now describe the 
behavior even if you open it in an editor that does things differently, 
and acknowledge that such editors exist. (The behavior of specific 
editors/toolkits might well change in the future.)
...
...
...
For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::
I don't see the connection between the text above and the example that follows.
...
...
# For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
Let me know if it's clear in the newest version, with this note:
...
Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.
Steven D'Aprano wrote:
...
...
...
Before the age of computers, most mechanical typewriters lacked the keys 
for the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most popular 
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked 
the 1 key but had a 0 distinct from O.
https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-n...
The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford 
Typewriter. As did possibly the best selling typewriter in history, the 
IBM Selectric (introduced in 1961).
http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewrit...
Perhaps you should say "many older mechanical typewriters"?
Ah, interesting! I only ever saw and read about ones that have a bunch 
of accented letters, leaving no space for dedicated 0/1 keys :)
My typewriter looks like this: https://imgur.com/a/J34gqVZ
...
...
...
Bidirectional Text
------------------
The section on bidirectional text is interesting, because reading it in 
my email client mutt, all the examples are left to right.
You might like to note that not all applications support bidirectional 
text.
It might be handled by your terminal rather than mutt.
I made the text work even if the examples don't render the way I'd like.