pre-PEP: Unicode Security Considerations for Python
Hello, Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it. See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694. This is not a bug in Python. As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language. I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo. Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.
PEP: 9999 Title: Unicode Security Considerations for Python Author: Petr Viktorin <encukou@gmail.com> Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History:
Abstract ========
This document explains possible ways to misuse Unicode to write Python programs that appear to do something else than they actually do.
This document does not give any recommendations and solutions.
Introduction ============
Python code is written in `Unicode`_ – a system for encoding and handling all kinds of written language. While this allows programmers from all around the world to express themselves, it also allows writing code that is potentially confusing to readers.
It is possible to misuse Python's Unicode-related features to write code that *appears* to do something else than what it does. Evildoers could take advantage of this to trick code reviewers into accepting malicious code.
The possible issues generally can't be solved in Python itself without excessive restrictions of the language. They should be solved in code edirors and review tools (such as *diff* displays), by enforcing project-specific policies, and by raising awareness of individual programmers.
This document purposefully does not give any solutions or recommendations: it is rather a list of things to keep in mind.
This document is specific to Python. For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
Acknowledgement ===============
Investigation for this document was prompted by [CVE-2021-42574], *Trojan Source Attacks* reported by Nicholas Boucher and Ross Anderson, which focuses on Bidirectional override characters in a variety of languages.
Confusing Features ==================
This section lists some Unicode-related features that can be surprising or misusable.
ASCII-only Considerations -------------------------
ASCII is a subset of Unicode
While issues with the ASCII character set are generally well understood, the're presented here to help better understanding of the non-ASCII cases.
Confusables and Typos '''''''''''''''''''''
Some characters look alike. Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l`` (lowercase L) instead. Human readers could tell them apart by context only. In programming language, however, distinction between digits and letters is critical -- and most fonts designed for programmers make it easy to tell them apart.
Similarly, the uppercase “I” and lowercase “l” can look similar in fonts designed for human languages, but programmers' fonts make them noticeably different.
However, what is “noticeably” different always depend on the context. Humans tend to ignore details in longer identifiers: the variable name ``accessibi1ity_options`` can still look indistinguishable from ``accessibility_options``, while they are distinct for the compiler.
The same can be said for plain typos: most humans will not notice the typo in ``responsbility_chain_delegate``.
Control Characters ''''''''''''''''''
Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF`` pairs (``\r\n``) as an end of line characters. Most code editors do as well, but there are editors that display “non-native” line endings as unknown characters (or nothing at all), rather than ending the line, displaying this example::
# Don't call this function: fire_the_missiles()
as a harmless comment like::
# Don't call this function:⬛fire_the_missiles()
CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.
Some characters can be used to hide/overwrite other characters when source is listed in common terminals:
* BS (``\b``, Backspace) moves the cursor back, so the character after it will overwrite the character before. * CR (``\r``, carriage return) moves the cursor to the start of line, subsequent characters overwrite the start of the line. * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary control of the terminal.
Confusable Characters in Identifiers ------------------------------------
Python allows characters of all any scripts – Latin letters to ancient Egyptian hieroglyphs – in identifiers (such as variable names). See :pep:`3131` for details and rationale. Only “letters and numbers” are allowed (see `Identifiers and keywords`_ for details), so while ``γάτα`` is a valid Python identifier, ``🐱`` is not. Non-printing control characters are also not allowed.
However, within the allowed set there is a large number of “confusables”. For example, the uppercase versions of the Latin `b`, Greek `β` (Beta), and Cyrillic `в` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
This allows identifiers that look the same to humans, but not to Python. For example, all of the following are distinct identifiers:
* ``scope`` (Latin, ASCII-only) * ``scоpe`` (wih a Cyrillic `о`) * ``scοpe`` (with a Greek `ο`) * ``ѕсоре`` (all Cyrillic letters)
Additionally, some letters can look like non-letters:
* The letter for the Hawaiian *ʻokina* looks like an apostrophe; ``ʻHelloʻ`` is a Python identifier, not a string. * The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement.
.. note::
The converse also applies – some symbols look like letters – but since Python does not allow arbitrary symbols in identifiers, this is not an issue.
Confusable Digits ------------------
Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such as ``.`` or ``e``).
However, when numbers are converted from strings, such as in the ``int`` and ``float`` constructors or by the ``str.format`` method, any decimal digit can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫`` (``TAMIL DIGIT FIVE``) work as the digit ``5``.
Some scripts include digits that look similar to ASCII ones, but have a different value. For example::
>>> int('৪୨') 42 >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five') five
Bidirectional Text ------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left. Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren't familiar with these writing systems and their computer representation.
The exact process is complicated, and explained in Unicode® Standard Annex #9, "Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
* In the statement ``قيمة = ערך``, the variable ``قيمة`` is set to the value of ``ערך``.
* In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and then subtracted from ``قيمة``. The *opening* parenthesis is displayed as ``)``.
* In the following, the second line is the same as first, except ``A`` is replaced by the Hebrew ``א``. Both assign a 100-character string to the variable ``s``. Note how the symbols and numbers between Hebrew characters is shown in reverse order::
s = "A" * 100 # "A" is assigned
s = "א" * 100 # "א" is assigned
Bidirectional Marks, Embeddings, Overrides and Isolates -------------------------------------------------------
The rules for determining the direction of text do not always yield the intended results, so Unicode provides several ways to alter it.
The most basic are **directional marks**, which are invisible but affect text as a left-to-right (or right-to-left) character would. Following with the example above, in the next example the ``A``/``א`` is replaced by the Latin ``x`` followed or preceded by a right-to-left mark (``U+200F``). This assigns a 200-character string to ``s`` (100 copies of `x` interspersed with 100 invisible marks)::
s = "x" * 100 # "x" is assigned
The directional **embedding**, **override** and **isolate** characters are also invisible, but affect the ordering of all text after them until either ended by a dedicated character, or until the end of line. (Unicode specifies the effect to last until the end of a “paragraph” (see [tr9]_), but allows tools to interpret newline characters as paragraph ends (see [u5.8]_). Most code editors and terminals do so.)
These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python's comments always extend to the end of a line), but it doesn't render them harmless.
Normalizing identifiers -----------------------
Python strings are collections of *Unicode codepoints*, not “characters”.
For reasons like compatibility with earlier encodings, Unicode often has several ways to encode what is essentially a single “character”. For example, all are these different ways of writing ``Å`` as a Python string, each of which is unequal to the others.
* ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint) * ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints) * ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
For another example, the ligature ``fi`` has a dedicated Unicode codepoint, even though it has the same meaning as the two letters ``fi``.
Also, common letters frequently have several distinct variations. Unicode provides them for contexts where the difference has some semantic meaning, like mathematics. For example, some variations of ``n`` are:
* ``n`` (LATIN SMALL LETTER N) * ``𝐧`` (MATHEMATICAL BOLD SMALL N) * ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N) * ``n`` (FULLWIDTH LATIN SMALL LETTER N) * ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
Unicode includes alorithms to *normalize* variants like these to a single form, and Python identifiers are normalized. (There are several normal forms; Python uses ``NFKC``.)
For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
>>> xⁿ = 8 >>> xn 8
… as is ``fi`` and ``fi``, and as are the different ways to encode ``Å``.
This normalization applies *only* to identifiers, however. Functions that treat strings as identifiers, such as ``getattr``, do not perform normalization::
class Test: ... def finalize(self): ... print('OK') ... Test().finalize() OK Test().finalize() OK getattr(Test(), 'finalize') Traceback (most recent call last): ... AttributeError: 'Test' object has no attribute 'finalize'
This also applies when importing:
* ``import finalization`` performs normalization, and looks for a file named ``finalization.py`` (and other ``finalization.*`` files). * ``importlib.import_module("finalization")`` does not normalize, so it looks for a file named ``finalization.py``.
Some filesystems independently apply normalization and/or case folding. On some systems, ``finalization.py``, ``finalization.py`` and ``FINALIZATION.py`` are three distinct filenames; on others, some or all of these can name the same file.
Source Encoding ---------------
The encoding of Python source files is given by a specific regex on the first two lines of a file, as per `Encoding declarations`_. This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
This can be misused in combination with Python-specific special-purpose encodings (see `Text Encodings`_). For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example::
# For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well.
import os
message = ''' This is "Hello World" in Japanese: \u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
This runs `echo WHOA` in your shell: \u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e \u0073\u0079\u0073\u0074\u0065\u006d\u0028 \u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027 \u0029\u0029\u002c\u0027\u0027\u0027 '''
Open Issues ===========
We should probably write and publish:
* Recommendations for Text Editors and Code Tools * Recommendations for Programmers and Teams * Possible Improvements in Python
References ==========
.. _Unicode: https://home.unicode.org/ .. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarati... .. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers .. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings .. [u5.8] http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213 .. [tr9] http://www.unicode.org/reports/tr9/ .. [tr36] Unicode Technical Report #36: Unicode Security Considerations http://www.unicode.org/reports/tr39/ .. [tr39] Unicode® Technical Standard #39: Unicode Security Mechanisms http://www.unicode.org/reports/tr39/ .. [CVE-2021-42574] CVE-2021-42574 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
Copyright =========
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
Thanks for writing this Petr! A few comments below. On Mon, Nov 01, 2021 at 01:17:02PM +0100, Petr Viktorin wrote:
ASCII-only Considerations -------------------------
ASCII is a subset of Unicode
While issues with the ASCII character set are generally well understood, the're presented here to help better understanding of the non-ASCII cases.
You should mention that some very common typefaces (fonts) are more confusable than others. For instance, Arial (a common font on Windows systems) makes the two letter combination 'rn' virtually indistinguishable from the single letter 'm'.
Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O. https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-n... The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford Typewriter. As did possibly the best selling typewriter in history, the IBM Selectric (introduced in 1961). http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewrit... Perhaps you should say "many older mechanical typewriters"?
Bidirectional Text ------------------
The section on bidirectional text is interesting, because reading it in my email client mutt, all the examples are left to right. You might like to note that not all applications support bidirectional text.
Unicode includes alorithms to *normalize* variants like these to a single form, and Python identifiers are normalized.
Typo: "algorithms". This is a good and useful document, thank you again. -- Steve
This is excellent! 01.11.21 14:17, Petr Viktorin пише:
CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.
It is an implementation detail and we will get rid of it. It only happens when you read the Python script from a file. If you import it as a module or run with runpy, the NUL character is an error.
Some characters can be used to hide/overwrite other characters when source is listed in common terminals:
* BS (``\b``, Backspace) moves the cursor back, so the character after it will overwrite the character before. * CR (``\r``, carriage return) moves the cursor to the start of line, subsequent characters overwrite the start of the line. * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary control of the terminal.
ESC (``\x1B``) starts many control sequences. ``\1A`` means the end of the text file on Windows. Some programs (for example "type") ignore the rest of the file.
On 01. 11. 21 18:32, Serhiy Storchaka wrote:
This is excellent!
01.11.21 14:17, Petr Viktorin пише:
CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.
It is an implementation detail and we will get rid of it. It only happens when you read the Python script from a file. If you import it as a module or run with runpy, the NUL character is an error.
That brings us to possible changes in Python in this area, which is an interesting topic. As for \0, can we ban all ASCII & C1 control characters except whitespace? I see no place for them in source code. For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't? For right-to-left text: does anyone actually name identifiers in Hebrew/Arabic? AFAIK, we should allow a few non-printing "joiner"/"non-joiner" characters to make it possible to use all Arabic words. But it would be great to consult with users/teachers of the languages. Should Python run the bidi algorithm when parsing and disallow reordered tokens? Maybe optionally?
02.11.21 16:16, Petr Viktorin пише:
As for \0, can we ban all ASCII & C1 control characters except whitespace? I see no place for them in source code.
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no reason to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings.
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't?
It would virtually ban Cyrillic. There is a lot of Cyrillic letters which look like Latin letters, and there are complete words written in Cyrillic which by accident look like other words written in Latin. It is a work for linters, which can have many options for configuring acceptable scripts, use spelling dictionaries and dictionaries of homoglyphs, etc.
Serhiy Storchaka wrote:
02.11.21 16:16, Petr Viktorin пише:
As for \0, can we ban all ASCII & C1 control characters except whitespace? I see no place for them in source code.
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no reason to include control characters in them.
If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't? It would virtually ban Cyrillic. There is a lot of Cyrillic letters which look like Latin letters, and there are complete words written in Cyrillic which by accident look like other words written in Latin.
At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _). Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages.
It is a work for linters, which can have many options for configuring acceptable scripts, use spelling dictionaries and dictionaries of homoglyphs, etc.
It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell. -jJ
Jim J. Jewett writes:
At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).
This would ban "παν語", aka "pango". That's arguably a good idea (IMO, 0.9 wink), but might make some GTK/GNOME folks sad.
Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages.
Interesting. I maintained a couple of Emacs libraries (dictionaries and input methods) for Japanese in XEmacs, and while hyphen-separated mixtures of ASCII and Japanese are common, I don't recall ever seeing an identifier with ASCII and Japanese glommed together without a separator. It was almost always of the form "English verb - Japanese lexical component". Or do you consider that "relatively complicated"?
It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell.
It would have to be easy to turn off, perhaps even provide instructions in the messages. I would guess that for code that uses it at all, it would be common. So the warnings would likely make those tools somewhere between really annoying and unusable.
Stephen J. Turnbull wrote:
Jim J. Jewett writes:
At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).
This would ban "παν語", aka "pango". That's arguably a good idea (IMO, 0.9 wink), but might make some GTK/GNOME folks sad.
I am not quite motivated enough to search the archives, but I'm pretty sure the examples actually found were less prominent than that. There seemed to be at least one or two fora where it was something of a local idiom.
... I don't recall ever seeing an identifier with ASCII and Japanese glommed together without a separator. It was almost always of the form "English verb - Japanese lexical component".
The problem was that some were written without a "-" or "_" to separate the halves. It looked fine -- the script change was obvious to even someone who didn't speak the non-English language. But having to support that meant any remaining restriction on mixed scripts would be either too weak to be worthwhile, or too complicated to write into the python language specification. -jJ
02.11.21 18:49, Jim J. Jewett пише:
If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.
If you mean backslash-escaped sequences like \uXXXX, there is no reason to ban them in comments. Unlike to Java they do not have special meaning outside of string literals. But if you mean terminal control sequences (which change color or move cursor) they should not be allowed in comments.
At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).
I implemented this restrictions in one of my projects. The character set was limited, and even this did not solve all issues with homoglyphs. I think that we should not introduce such arbitrary limitations at the parser level and left it to linters.
Serhiy Storchaka writes:
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too.
+1
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't?
It would virtually ban Cyrillic.
+1 (for the comment and for the implied -1 on SyntaxWarning, let's keep the Cyrillic repertoire in Python!)
It is a work for linters,
+1 Aside from the reasons Serhiy presents, I'd rather not tie this kind of rather ambiguous improvement in Unicode handling to the release cycle. It might be worth having a pep9999 module/script in Python (perhaps more likely, PyPI but maintained by whoever does the work to make these improvements + Petr or somebody Petr trusts to do it), that lints scripts specifically for confusables and other issues. Steve
We seem to agree that this is work for linters. That's reasonable; I'd generalize it to "tools and policies". But even so, discussing what we'd expect linters to do is on topic here. Perhaps we can even find ways for the language to support linters -- type checking is also for external tools, but has language support. For example: should the parser emit a lightweight audit event if it finds a non-ASCII identifier? (See below for why ASCII is special.) Or for encoding declarations? On 03. 11. 21 6:26, Stephen J. Turnbull wrote:
Serhiy Storchaka writes:
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too.
+1
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't?
It would virtually ban Cyrillic.
+1 (for the comment and for the implied -1 on SyntaxWarning, let's keep the Cyrillic repertoire in Python!)
I don't think this would actually ban Cyrillic/Greek. (My suggestion is not vanilla confusables detection; it might require careful reading: "should there be a [linter] warning when an identifier looks like ASCII but isn't?") I am not a native speaker, but I did try a bit to find an actual ASCII-like word in a language that uses Cyrillic. I didn't succeed; I think they might be very rare. Even if there was such a word -- or a one-letter abbreviation used as a variable name -- it would be confusing to use. Removing the possibility of confusion could *help* Cyrillic users. (I can't speak for them; this is just a brainstorming idea.) Steven adds:
Let's not enshrine as a language "feature" that non Western European languages are dangerous second-class citizens.
That would be going too far, yes, but the fact is that non-English languages *are* second-class citizens. Code that uses Python keywords and stdlib must use English, and possibly another language. It is the mixing of languages that can be dangerous/confusing, not the languages themselves.
It is a work for linters,
+1
Aside from the reasons Serhiy presents, I'd rather not tie this kind of rather ambiguous improvement in Unicode handling to the release cycle.
It might be worth having a pep9999 module/script in Python (perhaps more likely, PyPI but maintained by whoever does the work to make these improvements + Petr or somebody Petr trusts to do it), that lints scripts specifically for confusables and other issues.
If I have any say in it, the name definitely won't include a PEP number ;)
03.11.21 14:31, Petr Viktorin пише:
For example: should the parser emit a lightweight audit event if it finds a non-ASCII identifier? (See below for why ASCII is special.) Or for encoding declarations?
There are audit events for import and compile. You can also register import hooks if you want more fanny preprocessing than just unicode-encoding. I do not think we need to add more specific audit events, they were not designed for this. And I think it is too late to detect suspicious code at the time of its execution. It should be detected before adding that code to the code base (review tools, pre-commit hooks).
I don't think this would actually ban Cyrillic/Greek. (My suggestion is not vanilla confusables detection; it might require careful reading: "should there be a [linter] warning when an identifier looks like ASCII but isn't?")
Yes, but it should be optional and configurable and not be the part of the Python compiler. This is not our business as Python core developers.
I am not a native speaker, but I did try a bit to find an actual ASCII-like word in a language that uses Cyrillic. I didn't succeed; I think they might be very rare.
With simple script I have found 62 words common between English and Ukrainian: гасу/racy, горе/rope, рима/puma, міх/mix, etc. But there are much more English and Ukrainian words which contains only letters which can be confused with letters from other script. And identifiers can contains abbreviations and shortening, they are not all can be found in dictionaries.
Even if there was such a word -- or a one-letter abbreviation used as a variable name -- it would be confusing to use. Removing the possibility of confusion could *help* Cyrillic users. (I can't speak for them; this is just a brainstorming idea.)
I never used non-Latin identifiers in Python, but I guess that where they are used (in schools?) there is a mix of English and non-English identifiers, and identifiers consisting of parts of English and non-English words without even an underscore between them. I know because in other languages they just use inconsistent transliteration. Emitting any warning by default is a discrimination of non-English users. It would be better to not add support of non-ASCII identifiers at first place.
On Tue, Nov 02, 2021 at 05:55:55PM +0200, Serhiy Storchaka wrote:
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no reason to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings.
Agreed. I don't think there is any good reason for including control characters (apart from whitespace) in comments. In strings, I would consider allowing VT (vertical tab) as well, that is whitespace.
'\v'.isspace() True
But I don't have a strong opinion on that. [Petr]
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't?
Let's not enshrine as a language "feature" that non Western European languages are dangerous second-class citizens.
It would virtually ban Cyrillic. There is a lot of Cyrillic letters which look like Latin letters, and there are complete words written in Cyrillic which by accident look like other words written in Latin.
Agreed.
It is a work for linters, which can have many options for configuring acceptable scripts, use spelling dictionaries and dictionaries of homoglyphs, etc.
Linters and editors. I have no objection to people using editors that highlight non-ASCII characters in blinking red letters, so long as I can turn that option off :-) -- Steve
On Tue, Nov 2, 2021 at 7:21 AM Petr Viktorin <encukou@gmail.com> wrote:
That brings us to possible changes in Python in this area, which is an interesting topic.
Is there a use case or need for allowing the comment-starting character “#” to occur when text is still in the right-to-left direction? Disallowing that would prevent Petr’s examples in which active code is displayed after the comment mark, which to me seems to be one of the more egregious examples. Or maybe this case is no worse than others and isn’t worth singling out. —Chris
As for \0, can we ban all ASCII & C1 control characters except whitespace? I see no place for them in source code.
For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't?
For right-to-left text: does anyone actually name identifiers in Hebrew/Arabic? AFAIK, we should allow a few non-printing "joiner"/"non-joiner" characters to make it possible to use all Arabic words. But it would be great to consult with users/teachers of the languages. Should Python run the bidi algorithm when parsing and disallow reordered tokens? Maybe optionally? _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QW... Code of Conduct: http://python.org/psf/codeofconduct/
Serhiy Storchaka writes:
This is excellent!
01.11.21 14:17, Petr Viktorin пише:
CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.
It is an implementation detail and we will get rid of it.
You can't, probably not for a decade, because people will be running versions of Python released before you change it. I hope this PEP will address Python as it is as well as as it will be.
It only happens when you read the Python script from a file.
Which is one of the likely vectors for malware. It might be worth teaching virus checkers about this, for example.
This is an excellent enumeration of some of the concerns! One minor comment about the introductory material: On Mon, Nov 1, 2021 at 5:21 AM Petr Viktorin <encukou@gmail.com> wrote:
Introduction ============
Python code is written in `Unicode`_ – a system for encoding and handling all kinds of written language.
Unicode specifies the mapping of glyphs to code points. Then a second mapping from code points to sequences of bytes is what is actually recorded by the computer. The second mapping is what programmers using Python will commonly think of as the encoding while the majority of what you're writing about has more to do with the first mapping. I'd try to word this in a way that doesn't lead a reader to conflate those two mappings. Maybe something like this? `Unicode`_ is a system for handling all kinds of written language. It aims to allow any character from any human natural language (as well as a few characters which are not from natural languages) to be used. Python code may consist of almost all valid Unicode characters.
While this allows programmers from all around the world to express themselves, it also allows writing code that is potentially confusing to readers.
-Toshio
On Mon, Nov 01, 2021 at 11:41:06AM -0700, Toshio Kuratomi wrote:
Unicode specifies the mapping of glyphs to code points. Then a second mapping from code points to sequences of bytes is what is actually recorded by the computer. The second mapping is what programmers using Python will commonly think of as the encoding while the majority of what you're writing about has more to do with the first mapping.
I don't think that is correct. According to the Unicode consortium -- and I hope that they would know *wink* -- Unicode is the universal character encoding. In other words: "Unicode provides a unique number for every character" https://www.unicode.org/standard/WhatIsUnicode.html Not glyphs. ("Character" in natural language is a bit of a fuzzy concept, so I think that Unicode here is referring to what their glossary calls an abstract character.) The usual meaning of glyph is for the graphical images used by fonts (typefaces) for display. Sense 2 in the Unicode glossary here: https://www.unicode.org/glossary/#glyph I'm not really sure what they mean by sense 1, unless they mean a representative glyph, which is intended to stand in as an example of the entire range of glyphs. Unicode does not specify what the glyphs for code points are, although it does provide representative samples. See, for example, their comment on emoji: "The Unicode Consortium provides character code charts that show a representative glyph" http://www.unicode.org/faq/emoji_dingbats.html Their code point charts likewise show representative glyphs for other letters and symbols, not authoritative. And of course, many abstract characters do not have glyphs at all, e.g. invisible joiners, control characters, variation selectors, noncharacters, etc. The mapping from bytes to code points and abstract characters is also part of Unicode. The UTF encodings are part of Unicode: https://www.unicode.org/faq/utf_bom.html#gen2 The "U" in UTF literally stands for Unicode :-) -- Steve
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement." Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own. -jJ
On 11/1/2021 8:17 AM, Petr Viktorin wrote:
Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.
Very helpful.
Bidirectional Text ------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.] There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes. In extended handling, phrases ...
Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren't familiar with these writing systems and their computer representation.
The exact process is complicated, and explained in Unicode® Standard Annex #9, "Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
In local handling, one sees <hebrew-rtl> = 23`. In extended handling, one sees 23 = <hebrew-rtl>. (Notepad++ sees backticks as quotes.)
Source Encoding ---------------
The encoding of Python source files is given by a specific regex on the first two lines of a file, as per `Encoding declarations`_. This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
This can be misused in combination with Python-specific special-purpose encodings (see `Text Encodings`_).
Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to something?
For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example::
I don't see the connection between the text above and the example that follows.
# For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well. [etc]
-- Terry Jan Reedy
On 01. 11. 21 13:17, Petr Viktorin wrote:
Hello, Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it. See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
This is not a bug in Python. As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language.
I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo.
Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.
Thanks for the comments, everyone! I've updated the document and sent it to https://github.com/python/peps/pull/2129 A rendered version is at https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst Toshio Kuratomi wrote:
`Unicode`_ is a system for handling all kinds of written language. It aims to allow any character from any human natural language (as well as a few characters which are not from natural languages) to be used. Python code may consist of almost all valid Unicode characters.
Thanks! That's a nice summary; I condensed it a bit more and used it. (I'm not joining the conversation on glyphs, characters, codepoints and encodings -- that's much too technical for this document. Using the specific technical terms unfortunately doesn't help understanding, so I use the vague ones like "character" and "letter".) Jim J. Jewett wrote:
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement."
Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own.
I'm not a native speaker, but as I understand it, "十" is closer to a single-letter word than a single-digit number. It translates better as "ten" than "10". (And it appears in "十四", "fourteen", just like "four" appears in "fourteen".) Patrick Schultz wrote:
- The Unicode consortium has a list of confusables, in case useful
Yup, and it's linked from the documents that describe how to use it. I link to those rather than just the list. But thank you! Terry Reedy wrote:
Bidirectional Text ------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]
There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes.
I'd like to leave these details out of the document. The examples should render convincingly in browsers. The text should now describe the behavior even if you open it in an editor that does things differently, and acknowledge that such editors exist. (The behavior of specific editors/toolkits might well change in the future.)
For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example::
I don't see the connection between the text above and the example that follows.
# For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Steven D'Aprano wrote:
Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``
I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O.
https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-n...
The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford Typewriter. As did possibly the best selling typewriter in history, the IBM Selectric (introduced in 1961).
http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewrit...
Perhaps you should say "many older mechanical typewriters"?
Ah, interesting! I only ever saw and read about ones that have a bunch of accented letters, leaving no space for dedicated 0/1 keys :) My typewriter looks like this: https://imgur.com/a/J34gqVZ
Bidirectional Text ------------------
The section on bidirectional text is interesting, because reading it in my email client mutt, all the examples are left to right.
You might like to note that not all applications support bidirectional text.
It might be handled by your terminal rather than mutt. I made the text work even if the examples don't render the way I'd like.
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <encukou@gmail.com> wrote:
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings? ChrisA
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <encukou@gmail.com> wrote:
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings?
To be clear, are you proposing to deprecate the encodings *completely* or just as the source code encoding? Personally, I think that using obscure encodings as the source encoding is one of those "linters and code reviews should check it" issues. Besides, now that I've learned about this unicode_escape encoding, I think that's going to be *awesome* for winning obfuscated Python competitions! *wink* -- Steve
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <encukou@gmail.com> wrote:
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings?
To be clear, are you proposing to deprecate the encodings *completely* or just as the source code encoding?
Only source code encodings. Obviously we still need to be able to cope with all manner of *data*, but Python source code shouldn't need to be in bizarre, weird encodings. (Honestly, I'd love to just require that Python source code be UTF-8, but that would probably cause problems, so mandating that it be one of a small set of encodings would be a safer option.)
Personally, I think that using obscure encodings as the source encoding is one of those "linters and code reviews should check it" issues.
Besides, now that I've learned about this unicode_escape encoding, I think that's going to be *awesome* for winning obfuscated Python competitions! *wink*
TBH, I'm not entirely sure how valid it is to talk about *security* considerations when we're dealing with Python source code and variable confusions, but that's a term that is well understood. But to the extent that it is a security concern, it's not one that linters can really cope with. I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance. ChrisA
Chris Angelico wrote:
I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance.
If it becomes important, the cheeseshop backend can run various validations (including a linter) on submissions, and include those results in the display template.
On 03.11.2021 01:21, Chris Angelico wrote:
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <encukou@gmail.com> wrote:
Let me know if it's clear in the newest version, with this note:
Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc.
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings?
To be clear, are you proposing to deprecate the encodings *completely* or just as the source code encoding?
Only source code encodings. Obviously we still need to be able to cope with all manner of *data*, but Python source code shouldn't need to be in bizarre, weird encodings.
(Honestly, I'd love to just require that Python source code be UTF-8, but that would probably cause problems, so mandating that it be one of a small set of encodings would be a safer option.)
Most Python code will be written in UTF-8 going forward, but there's still a lot of code out there in other encodings. Limiting this to some reduced set doesn't really make sense, since it's not clear where to draw the line. Coming back to the thread topic, many of the Unicode security considerations don't apply to non-Unicode encodings, since those usually don't support e.g. changing the bidi direction within a stream of text or other interesting features you have in Unicode such as combining code points, invisible (space) code points, font rendering hint code points, etc. So in a sense, those non-Unicode encodings are safer than using UTF-8 :-) Please also note that most character lookalikes are not encoding issues, but instead font issues, which then result in the characters looking similar. There are fonts which are designed to avoid this and it's no surprise that source code fonts typically do make e.g. 0 and O, as well as 1 and l look sufficiently different to be able to notice the difference. Things get a lot harder when dealing with combining characters, since it's not always easy to spot the added diacritics, e.g. try this:
print ('a\u0348bc') # strong articulation a͈bc print ('a\u034Fbc') # combining grapheme joiner a͏bc
The latter is only "visible" in the unicode_escape encoding:
print ('a\u034Fbc'.encode('unicode_escape')) b'a\\u034fbc'
Projects wanting to limit code encoding settings, disallow using bidi markers and other special code points in source code, can easily do this via e.g. pre-commit hooks, special editor settings, code linters or security scanners. I don't think limiting the source code encoding is the right approach to making code more secure. Instead, tooling has to be used to detect potentially malicious code points in code. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 03 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
On Wed, 3 Nov 2021 at 10:11, Marc-Andre Lemburg <mal@egenix.com> wrote:
I don't think limiting the source code encoding is the right approach to making code more secure. Instead, tooling has to be used to detect potentially malicious code points in code.
+1 Discussing "making code more secure" without being clear on what the threat model is, is always going to be inconclusive. In this case, I believe the threat model is "an untrusted 3rd party submitting a PR which potentially contains malicious code to a Python project". For that threat, I think the correct approach is for core Python to promote awareness (via this PEP and maybe something in the docs themselves) and for projects to implement appropriate code checks that are run against all PRs to flag this sort of issue. What threat can't be addressed at a per-project level, but *can* be addressed in core Python (without triggering so many false positives that people are trained to ignore the warnings or work around the prohibitions, defeating the purpose of the change)? Paul
On Wed, Nov 03, 2021 at 11:11:00AM +0100, Marc-Andre Lemburg wrote:
Coming back to the thread topic, many of the Unicode security considerations don't apply to non-Unicode encodings, since those usually don't support e.g. changing the bidi direction within a stream of text or other interesting features you have in Unicode such as combining code points, invisible (space) code points, font rendering hint code points, etc.
So in a sense, those non-Unicode encodings are safer than using UTF-8 :-)
Thank you MAL for that timely reminder that most encodings are not Unicode. I have to admit that I often forget that there is a whole universe of non-Unicode, non-ASCII encodings.
Please also note that most character lookalikes are not encoding issues, but instead font issues, which then result in the characters looking similar.
+1 -- Steve
On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:
TBH, I'm not entirely sure how valid it is to talk about *security* considerations when we're dealing with Python source code and variable confusions, but that's a term that is well understood.
It's not like Unicode is the only way to write obfuscated code, malicious or otherwise.
But to the extent that it is a security concern, it's not one that linters can really cope with. I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance.
Do we require that PyPI prevents people from publishing code that causes confusion by its poorly written code and obfuscated and confusing identifiers? The linter is to *flag the issue* during, say, code review or before running the code, like other code quality issues. If you're just running random code you downloaded from the internet using pip, then Unicode confusables are the least of your worries. I'm not really sure why people get so uptight about Unicode confusables, while being blasé about the opportunities to smuggle malicious code into pure ASCII code. https://en.wikipedia.org/wiki/Underhanded_C_Contest Is it unfamiliarity? Worse? "Real programmers write identifiers in English." And the ironic thing is, while it is very difficult indeed for automated checkers to detect underhanded code in ASCII, it is trivially easier for editors, linters and other tools to spot the sort of Unicode confusables we're talking about here. But we spend all our energy worrying about the minor issue, and almost none on the broader problem of malicious code in general. I'm pretty sure I could upload a library to PyPI that included os.system('rm -rf .') and nobody would blink an eye, but if I write: A = 1 А = 2 Α = 3 print(A, А, Α) everyone goes insane. Let's keep the threat in perspective. Writing an informational PEP for the education of people is a great idea. Rushing into making wholesale changes to the interpreter, not so much. -- Steve
On Wed, Nov 3, 2021 at 10:22 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:
TBH, I'm not entirely sure how valid it is to talk about *security* considerations when we're dealing with Python source code and variable confusions, but that's a term that is well understood.
It's not like Unicode is the only way to write obfuscated code, malicious or otherwise.
But to the extent that it is a security concern, it's not one that linters can really cope with. I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance.
Do we require that PyPI prevents people from publishing code that causes confusion by its poorly written code and obfuscated and confusing identifiers?
The linter is to *flag the issue* during, say, code review or before running the code, like other code quality issues.
If you're just running random code you downloaded from the internet using pip, then Unicode confusables are the least of your worries.
I'm not really sure why people get so uptight about Unicode confusables, while being blasé about the opportunities to smuggle malicious code into pure ASCII code.
Right, which is why I was NOT talking about confusables. I don't consider them to be a particularly Unicode-related threat, although the larger range of available characters does make it more plausible than in ASCII. But I do see a problem with code where most editors misrepresent the code, where abuse of a purely ASCII character encoding for purely ASCII code can cause all kinds of tooling issues. THAT is a more viable attack vector, since code reviewers will be likely to assume that their syntax highlighting is correct. And yes, I'm aware that Python can't be expected to cope with poor tools, but when *many* well-known tools have the same problem, one must wonder who should be solving the issue. ChrisA
On 03. 11. 21 12:37, Chris Angelico wrote:
On Wed, Nov 3, 2021 at 10:22 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:
TBH, I'm not entirely sure how valid it is to talk about *security* considerations when we're dealing with Python source code and variable confusions, but that's a term that is well understood.
It's not like Unicode is the only way to write obfuscated code, malicious or otherwise.
But to the extent that it is a security concern, it's not one that linters can really cope with. I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance.
Do we require that PyPI prevents people from publishing code that causes confusion by its poorly written code and obfuscated and confusing identifiers?
The linter is to *flag the issue* during, say, code review or before running the code, like other code quality issues.
If you're just running random code you downloaded from the internet using pip, then Unicode confusables are the least of your worries.
I'm not really sure why people get so uptight about Unicode confusables, while being blasé about the opportunities to smuggle malicious code into pure ASCII code.
Right, which is why I was NOT talking about confusables. I don't consider them to be a particularly Unicode-related threat, although the larger range of available characters does make it more plausible than in ASCII.
But I do see a problem with code where most editors misrepresent the code, where abuse of a purely ASCII character encoding for purely ASCII code can cause all kinds of tooling issues. THAT is a more viable attack vector, since code reviewers will be likely to assume that their syntax highlighting is correct.
And yes, I'm aware that Python can't be expected to cope with poor tools, but when *many* well-known tools have the same problem, one must wonder who should be solving the issue.
This is a very good point. Let's not point fingers, but figure out how to make users' lives easier together :) This was the first time I was "in" on an embargoed "issue", and let me tell you, I was surprised by the amount of time spent on polishing the messaging. Now, you can't reasonably twist all this into a "Python is insecure" or "Company X products are insecure" headline, which is good, but with that out of the way we can focus on *what* could be improved over *where* the improvement could be and who should do it.
Chris Angelico writes:
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings?
I think that's pointless. With few exceptions (GB18030, Big5 has a couple of code point pairs that encode the same very rare characters, ISO 2022 extensions) you're not going to run into the confuseables problem, and AFAIK the only generic BIDI solution is Unicode (the ISO 8859 encodings of Hebrew and Arabic do not have direction markers). What exactly are you thinking? The only thing I'd like to see is to rearrange the codec aliases so that the "common names" would denote the maximal repertoires in each family (gb denotes gb18030, sjis denotes shift_jisx0213, etc) as in the WhatWG recommendations for web browsers. But that's probably too backward incompatible to fly.
On Wed, Nov 3, 2021 at 5:12 PM Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Chris Angelico writes:
Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings?
I think that's pointless. With few exceptions (GB18030, Big5 has a couple of code point pairs that encode the same very rare characters, ISO 2022 extensions) you're not going to run into the confuseables problem, and AFAIK the only generic BIDI solution is Unicode (the ISO 8859 encodings of Hebrew and Arabic do not have direction markers).
What exactly are you thinking?
You'll never eliminate confusables (even ASCII has some, depending on font). But I was surprised to find that Python would let you use unicode_escape for source code. # coding: unicode_escape x = ''' Code example: \u0027\u0027\u0027 # format in monospaced on the web site print("Did you think this would be executed?") \u0027\u0027\u0027 # end monospaced Surprise! ''' print("There are %d lines in x." % len(x.split(chr(10)))) With some carefully-crafted comments, a lot of human readers will ignore the magic tokens. It's not uncommon to put example code into triple-quoted strings, and it's also not all that surprising when simplified examples do things that you wouldn't normally want done (like monkeypatching other modules), since it's just an example, after all. I don't have access to very many editors, but SciTE, VS Code, nano, and the GitHub gist display all syntax-highlighted this as if it were a single large string. Only Idle showed it as code in between, and that's because it actually decoded it using the declared character coding, so the magic lines showed up with actual apostrophes. Maybe the phrase "a small handful" was a bit too hopeful, but would it be possible to mandate (after, obviously, a deprecation period) that source encodings be ASCII-compatible? ChrisA
Chris Angelico writes:
But I was surprised to find that Python would let you use unicode_escape for source code.
I'm not surprised. Today it's probably not necessary, but I've exchanged a lot of code (not Python, though) with folks whose editors were limited to 8 bit codes or even just ASCII. It wasn't frequent that I needed to discuss non-ASCII code with them (that they needed to run) but it would have been painful to do without some form of codec that encoded Japanese using only ASCII bytes.
Maybe the phrase "a small handful" was a bit too hopeful, but would it be possible to mandate (after, obviously, a deprecation period) that source encodings be ASCII-compatible?
Not sure what you mean there. In the usual sense of ASCII-compatible (the ASCII bytes always mean the corresponding character in the ASCII encoding), I think there are at least two ASCII-incompatible encodings that would cause a lot of pain if they were prohibited, specifically Shift JIS and Big5. (In certain contexts in those encodings an ASCII byte frequently is a trailing byte in a multibyte character.) I'm sure there is a ton of legacy Python code in those encodings in East Asia, some of which is still maintained in the original encoding. And of course UTF-16 is incompatible in that sense, although I don't know if anybody actually saves Python code in UTF-16. It might make sense to prohibit unicode_escape nowadays -- I think almost all systems now can handle Unicode properly, but I don't think we can go farther than that.
On Wed, Nov 3, 2021 at 8:01 PM Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Chris Angelico writes:
But I was surprised to find that Python would let you use unicode_escape for source code.
I'm not surprised. Today it's probably not necessary, but I've exchanged a lot of code (not Python, though) with folks whose editors were limited to 8 bit codes or even just ASCII. It wasn't frequent that I needed to discuss non-ASCII code with them (that they needed to run) but it would have been painful to do without some form of codec that encoded Japanese using only ASCII bytes.
Bearing in mind that string literals can always have their own escapes, this feature is really only important to the source code tokens themselves.
Maybe the phrase "a small handful" was a bit too hopeful, but would it be possible to mandate (after, obviously, a deprecation period) that source encodings be ASCII-compatible?
Not sure what you mean there. In the usual sense of ASCII-compatible (the ASCII bytes always mean the corresponding character in the ASCII encoding), I think there are at least two ASCII-incompatible encodings that would cause a lot of pain if they were prohibited, specifically Shift JIS and Big5. (In certain contexts in those encodings an ASCII byte frequently is a trailing byte in a multibyte character.)
Ah, okay, so much for that, then. What about the weaker sense: Characters below 128 are always and only represented by those byte values? So if you find byte value 39, it might not actually be an apostrophe, but if you're looking for an apostrophe, you know for sure that it'll be represented by byte value 39?
It might make sense to prohibit unicode_escape nowadays -- I think almost all systems now can handle Unicode properly, but I don't think we can go farther than that.
Yes. I'm sure someone will come along and say "but I have to have an all-ASCII source file, directly runnable, with non-ASCII variable names", because XKCD 1172, but I don't have enough sympathy for that obscure situation to want the mess that unicode_escape can give. ChrisA
Chris Angelico writes:
Ah, okay, so much for that, then. What about the weaker sense: Characters below 128 are always and only represented by those byte values? So if you find byte value 39, it might not actually be an apostrophe, but if you're looking for an apostrophe, you know for sure that it'll be represented by byte value 39?
1. The apostrophe that Python considers a string delimiter is always represented by byte value 39 in the compiler input. So the only time that wouldn't be true is if escape sequences are allowed to represent characters. I believe unicode_escape is the only codec that does. 2. There's always eval which will accept a string containing escape sequences.
Yes. I'm sure someone will come along and say "but I have to have an all-ASCII source file, directly runnable, with non-ASCII variable names", because XKCD 1172, but I don't have enough sympathy for that obscure situation to want the mess that unicode_escape can give.
It's not an obscure situation to me. As I wrote earlier, been there, done that, made my own T-shirt. I don't *think* it matters today, but the number of DOS machines and Windows 98 machines left in Japan is not zero. Probably they can't run Python 3, but that's not something I can testify to.
On 01.11.2021 13:17, Petr Viktorin wrote:
PEP: 9999 Title: Unicode Security Considerations for Python Author: Petr Viktorin <encukou@gmail.com> Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History:
Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 02 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
This is an amazing document, Petr. Really great work! I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP. On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
On 01.11.2021 13:17, Petr Viktorin wrote:
PEP: 9999 Title: Unicode Security Considerations for Python Author: Petr Viktorin <encukou@gmail.com> Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History:
Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ?
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Nov 02 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3L... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Nov 3, 2021 at 5:07 AM David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
This is an amazing document, Petr. Really great work!
I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP.
There are quite a few other PEPs that have similar sorts of advice, like PEP 257 on docstrings, and several of the type hinting PEPs. IMO it's fine. ChrisA
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr! On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. <david.mertz@gmail.com> wrote:
This is an amazing document, Petr. Really great work!
I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP.
On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg <mal@egenix.com> wrote:
On 01.11.2021 13:17, Petr Viktorin wrote:
PEP: 9999 Title: Unicode Security Considerations for Python Author: Petr Viktorin <encukou@gmail.com> Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History:
Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ?
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Nov 02 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3L... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRC... Code of Conduct: http://python.org/psf/codeofconduct/
On 03. 11. 21 2:58, Kyle Stanley wrote:
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr!
Well, this is the brief write-up :) Maybe it would work better if the info was integrated into the relevant parts of the docs, rather than be a separate HOWTO. I went with an informational PEP because it's quicker to publish.
On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. <david.mertz@gmail.com <mailto:david.mertz@gmail.com>> wrote:
This is an amazing document, Petr. Really great work!
I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP.
On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:
On 01.11.2021 13:17, Petr Viktorin wrote: >> PEP: 9999 >> Title: Unicode Security Considerations for Python >> Author: Petr Viktorin <encukou@gmail.com <mailto:encukou@gmail.com>> >> Status: Active >> Type: Informational >> Content-Type: text/x-rst >> Created: 01-Nov-2021 >> Post-History:
Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ?
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Nov 02 2021) >>> Python Projects, Coaching and Support ... https://www.egenix.com/ <https://www.egenix.com/> >>> Python Product Development ... https://consulting.egenix.com/ <https://consulting.egenix.com/> ________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ <https://www.egenix.com/company/contact/> https://www.malemburg.com/ <https://www.malemburg.com/>
_______________________________________________ Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org> https://mail.python.org/mailman3/lists/python-dev.python.org/ <https://mail.python.org/mailman3/lists/python-dev.python.org/> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3L... <https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
_______________________________________________ Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org> https://mail.python.org/mailman3/lists/python-dev.python.org/ <https://mail.python.org/mailman3/lists/python-dev.python.org/> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRC... <https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6OET4CKE... Code of Conduct: http://python.org/psf/codeofconduct/
03.11.21 12:36, Petr Viktorin пише:
On 03. 11. 21 2:58, Kyle Stanley wrote:
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr!
Well, this is the brief write-up :) Maybe it would work better if the info was integrated into the relevant parts of the docs, rather than be a separate HOWTO.
I went with an informational PEP because it's quicker to publish.
What is the supposed target audience of this document? If it is core Python developers only, then PEP is the right place to publish it. But I think that it rather describes potential issues in arbitrary Python project, and as such, it will be more accessible as a part of the Python documentation (as a HOW-TO article perhaps). AFAIK all other informational PEPs are about developing Python, not developing in Python (even if they are (mis)used (e.g. PEP 8) outside their scope).
On 03. 11. 21 12:33, Serhiy Storchaka wrote:
03.11.21 12:36, Petr Viktorin пише:
On 03. 11. 21 2:58, Kyle Stanley wrote:
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr!
Well, this is the brief write-up :) Maybe it would work better if the info was integrated into the relevant parts of the docs, rather than be a separate HOWTO.
I went with an informational PEP because it's quicker to publish.
What is the supposed target audience of this document?
Good question! At this point it looks like it's linter authors.
If it is core Python developers only, then PEP is the right place to publish it. But I think that it rather describes potential issues in arbitrary Python project, and as such, it will be more accessible as a part of the Python documentation (as a HOW-TO article perhaps). AFAIK all other informational PEPs are about developing Python, not developing in Python (even if they are (mis)used (e.g. PEP 8) outside their scope).
There's a bunch of packaging PEPs, or a PEP on what the the /usr/bin/python command should be. I think PEP 672 is in good company for now.
On 11/2/2021 1:02 PM, Marc-Andre Lemburg wrote:
On 01.11.2021 13:17, Petr Viktorin wrote:
PEP: 9999 Title: Unicode Security Considerations for Python Author: Petr Viktorin <encukou@gmail.com> Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History:
Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ?
There is already "Unicode HOW TO" We could add "Unicode problems and pitfalls". -- Terry Jan Reedy
participants (13)
-
Chris Angelico
-
Chris Jerdonek
-
David Mertz, Ph.D.
-
Jim J. Jewett
-
Kyle Stanley
-
Marc-Andre Lemburg
-
Paul Moore
-
Petr Viktorin
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Toshio Kuratomi