On 01. 11. 21 13:17, Petr Viktorin wrote:
Hello, Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it. See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
This is not a bug in Python. As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language.
I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo.
Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.
Thanks for the comments, everyone! I've updated the document and sent it to https://github.com/python/peps/pull/2129 A rendered version is at https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst Toshio Kuratomi wrote:
`Unicode`_ is a system for handling all kinds of written language. It aims to allow any character from any human natural language (as well as a few characters which are not from natural languages) to be used. Python code may consist of almost all valid Unicode characters.
Thanks! That's a nice summary; I condensed it a bit more and used it. (I'm not joining the conversation on glyphs, characters, codepoints and encodings -- that's much too technical for this document. Using the specific technical terms unfortunately doesn't help understanding, so I use the vague ones like "character" and "letter".) Jim J. Jewett wrote:
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement."
Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own.
I'm not a native speaker, but as I understand it, "十" is closer to a single-letter word than a single-digit number. It translates better as "ten" than "10". (And it appears in "十四", "fourteen", just like "four" appears in "fourteen".) Patrick Schultz wrote:
- The Unicode consortium has a list of confusables, in case useful
Yup, and it's linked from the documents that describe how to use it. I link to those rather than just the list. But thank you! Terry Reedy wrote:
Bidirectional Text ------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]
There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes.
I'd like to leave these details out of the document. The examples should render convincingly in browsers. The text should now describe the behavior even if you open it in an editor that does things differently, and acknowledge that such editors exist. (The behavior of specific editors/toolkits might well change in the future.)
For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example::
I don't see the connection between the text above and the example that follows.
# For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Steven D'Aprano wrote:
Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O.
https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-n...
The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford Typewriter. As did possibly the best selling typewriter in history, the IBM Selectric (introduced in 1961).
http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewrit...
Perhaps you should say "many older mechanical typewriters"?
Ah, interesting! I only ever saw and read about ones that have a bunch of accented letters, leaving no space for dedicated 0/1 keys :) My typewriter looks like this: https://imgur.com/a/J34gqVZ
Bidirectional Text ------------------
The section on bidirectional text is interesting, because reading it in my email client mutt, all the examples are left to right.
You might like to note that not all applications support bidirectional text.
It might be handled by your terminal rather than mutt. I made the text work even if the examples don't render the way I'd like.