New subject: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

1 Nov 2021

      Hello,
Today, an attack called "Trojan source" was revealed, where a malicious 
contributor can use Unicode features (left-to-right text and homoglyphs) 
to code that, when shown in an editor, will look different from how a 
computer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report 
and decided that it should be handled in code editors, diff viewers, 
repository frontends and similar software, rather than in the language.

I agree: in my opinion, the attack is similar to abusing any other 
"gotcha" where Python doesn't parse text as a non-expert human would. 
For example: `if a or b == 'yes'`, mutable default arguments, or a 
misleading typo.

Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.
...
PEP: 9999
Title: Unicode Security Considerations for Python
Author: Petr Viktorin <encukou@gmail.com>
Status: Active
Type: Informational
Content-Type: text/x-rst
Created: 01-Nov-2021
Post-History:
Abstract
========
This document explains possible ways to misuse Unicode to write Python
programs that appear to do something else than they actually do.
This document does not give any recommendations and solutions.
Introduction
============
Python code is written in `Unicode`_ – a system for encoding and
handling all kinds of written language.
While this allows programmers from all around the world to express themselves,
it also allows writing code that is potentially confusing to readers.
It is possible to misuse Python's Unicode-related features to write code that
*appears* to do something else than what it does.
Evildoers could take advantage of this to trick code reviewers into
accepting malicious code.
The possible issues generally can't be solved in Python itself without
excessive restrictions of the language.
They should be solved in code edirors and review tools
(such as *diff* displays), by enforcing project-specific policies,
and by raising awareness of individual programmers.
This document purposefully does not give any solutions
or recommendations: it is rather a list of things to keep in mind.
This document is specific to Python.
For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
Acknowledgement
===============
Investigation for this document was prompted by [CVE-2021-42574],
*Trojan Source Attacks* reported by Nicholas Boucher and Ross Anderson,
which focuses on Bidirectional override characters in a variety of languages.
Confusing Features
==================
This section lists some Unicode-related features that can be surprising
or misusable.
ASCII-only Considerations
-------------------------
ASCII is a subset of Unicode
While issues with the ASCII character set are generally well understood,
the're presented here to help better understanding of the non-ASCII cases.
Confusables and Typos
'''''''''''''''''''''
Some characters look alike.
Before the age of computers, most mechanical typewriters lacked the keys for
the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
(lowercase L) instead. Human readers could tell them apart by context only.
In programming language, however, distinction between digits and letters is
critical -- and most fonts designed for programmers make it easy to tell them
apart.
Similarly, the uppercase “I” and lowercase “l” can look similar in fonts
designed for human languages, but programmers' fonts make them noticeably
different.
However, what is “noticeably” different always depend on the context.
Humans tend to ignore details in longer identifiers: the variable name
``accessibi1ity_options`` can still look indistinguishable from
``accessibility_options``, while they are distinct for the compiler.
The same can be said for plain typos: most humans will not notice the typo in
``responsbility_chain_delegate``.
Control Characters
''''''''''''''''''
Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
pairs (``\r\n``) as an end of line characters.
Most code editors do as well, but there are editors that display “non-native”
line endings as unknown characters (or nothing at all), rather than ending
the line, displaying this example::
# Don't call this function:
    fire_the_missiles()
as a harmless comment like::
# Don't call this function:⬛fire_the_missiles()
CPython treats the control character NUL (``\0``) as end of input,
but many editors simply skip it, possibly showing code that Python will not
run as a regular part of a file.
Some characters can be used to hide/overwrite other characters when source is
listed in common terminals:
* BS (``\b``, Backspace) moves the cursor back, so the character after it
  will overwrite the character before.
* CR (``\r``, carriage return) moves the cursor to the start of line,
  subsequent characters overwrite the start of the line.
* DEL (``\x7F``) commonly initiates escape codes which allow arbitrary
  control of the terminal.
Confusable Characters in Identifiers
------------------------------------
Python allows characters of all any scripts – Latin letters to ancient Egyptian
hieroglyphs – in identifiers (such as variable names).
See :pep:`3131` for details and rationale.
Only “letters and numbers” are allowed (see `Identifiers and keywords`_
for details), so while ``γάτα`` is a valid Python identifier, ``🐱`` is not.
Non-printing control characters are also not allowed.
However, within the allowed set there is a large number of “confusables”.
For example, the uppercase versions of the Latin `b`, Greek `β` (Beta), and
Cyrillic `в` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
This allows identifiers that look the same to humans, but not to Python.
For example, all of the following are distinct identifiers:
* ``scope`` (Latin, ASCII-only)
* ``scоpe`` (wih a Cyrillic `о`)
* ``scοpe`` (with a Greek `ο`)
* ``ѕсоре`` (all Cyrillic letters)
Additionally, some letters can look like non-letters:
* The letter for the Hawaiian *ʻokina* looks like an apostrophe;
  ``ʻHelloʻ`` is a Python identifier, not a string.
* The East Asian symbol for *ten* looks like a plus sign,
  so ``十= 10`` is a complete Python statement.
.. note::
The converse also applies – some symbols look like letters – but since
   Python does not allow arbitrary symbols in identifiers, this is not an
   issue.
Confusable  Digits
------------------
Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such
as ``.`` or ``e``).
However, when numbers are converted from strings, such as in the ``int`` and
``float`` constructors or by the ``str.format`` method, any decimal digit
can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫``
(``TAMIL DIGIT FIVE``) work as the digit ``5``.
Some scripts include digits that look similar to ASCII ones, but have a
different value. For example::
>>> int('৪୨')
    42
    >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
    five
Bidirectional Text
------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems and their
computer representation.
The exact process is complicated, and explained in Unicode® Standard Annex #9,
"Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
* In the statement ``قيمة = ערך``, the variable ``قيمة`` is set
  to the value of ``ערך``.
* In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and
  then subtracted from ``قيمة``.
  The *opening* parenthesis is displayed as ``)``.
* In the following, the second line is the same as first, except
  ``A`` is replaced by the Hebrew ``א``. Both assign a 100-character string to
  the variable ``s``.
  Note how the symbols and numbers between Hebrew characters is shown in
  reverse order::
s = "A" * 100 #    "A" is assigned
s = "א" * 100 #    "א" is assigned
Bidirectional Marks, Embeddings, Overrides and Isolates
-------------------------------------------------------
The rules for determining the direction of text do not always yield the
intended results, so Unicode provides several ways to alter it.
The most basic are **directional marks**, which are invisible but affect text
as a left-to-right (or right-to-left) character would.
Following with the example above, in the next example the ``A``/``א`` is
replaced by the Latin ``x`` followed or preceded by a
right-to-left mark (``U+200F``). This assigns a 200-character string to ``s``
(100 copies of `x` interspersed with 100 invisible marks)::
s = "x‏" * 100 #    "‏x" is assigned
The directional **embedding**, **override** and **isolate** characters
are also invisible, but affect the ordering of all text after them until either
ended by a dedicated character, or until the end of line.
(Unicode specifies the effect to last until the end of a “paragraph” (see [tr9]_),
but allows tools to interpret newline characters as paragraph ends
(see [u5.8]_). Most code editors and terminals do so.)
These characters essentially allow arbitrary reordering of the text that
follows them. Python only allows them in strings and comments, which does limit
their potential (especially in combination with the fact that Python's comments
always extend to the end of a line), but it doesn't render them harmless.
Normalizing identifiers
-----------------------
Python strings are collections of *Unicode codepoints*, not “characters”.
For reasons like compatibility with earlier encodings, Unicode often has
several ways to encode what is essentially a single “character”.
For example, all are these different ways of writing ``Å`` as a Python string,
each of which is unequal to the others.
* ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint)
* ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints)
* ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
For another example, the ligature ``ﬁ`` has a dedicated Unicode codepoint,
even though it has the same meaning as the two letters ``fi``.
Also, common letters frequently have several distinct variations.
Unicode provides them for contexts where the difference has some semantic
meaning, like mathematics. For example, some variations of ``n`` are:
* ``n`` (LATIN SMALL LETTER N)
* ``𝐧`` (MATHEMATICAL BOLD SMALL N)
* ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
* ``ｎ`` (FULLWIDTH LATIN SMALL LETTER N)
* ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
Unicode includes alorithms to *normalize* variants like these to a single
form, and Python identifiers are normalized.
(There are several normal forms; Python uses ``NFKC``.)
For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
>>> xⁿ = 8
    >>> xn
    8
… as is ``ﬁ`` and ``fi``, and as are the different ways to encode ``Å``.
This normalization applies *only* to identifiers, however.
Functions that treat strings as identifiers, such as ``getattr``,
do not perform normalization::
...
...
...
class Test:
   ...     def ﬁnalize(self):
   ...         print('OK')
   ...
Test().finalize()
   OK
Test().ﬁnalize()
   OK
getattr(Test(), 'ﬁnalize')
   Traceback (most recent call last):
     ...
   AttributeError: 'Test' object has no attribute 'ﬁnalize'
This also applies when importing:
* ``import ﬁnalization`` performs normalization, and looks for a file
  named ``finalization.py`` (and other ``finalization.*`` files).
* ``importlib.import_module("ﬁnalization")`` does not normalize,
  so it looks for a file named ``ﬁnalization.py``.
Some filesystems independently apply normalization and/or case folding.
On some systems, ``ﬁnalization.py``, ``finalization.py`` and
``FINALIZATION.py`` are three distinct filenames; on others, some or all
of these can name the same file.
Source Encoding
---------------
The encoding of Python source files is given by a specific regex on the first
two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).
For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::
# For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.
import os
message = '''
    This is "Hello World" in Japanese:
    \u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
This runs `echo WHOA` in your shell:
    \u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
    \u0073\u0079\u0073\u0074\u0065\u006d\u0028
    \u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
    \u0029\u0029\u002c\u0027\u0027\u0027
    '''
Open Issues
===========
We should probably write and publish:
* Recommendations for Text Editors and Code Tools
* Recommendations for Programmers and Teams
* Possible Improvements in Python
References
==========
.. _Unicode: https://home.unicode.org/
.. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarati...
.. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
.. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings
.. [u5.8] http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213
.. [tr9] http://www.unicode.org/reports/tr9/
.. [tr36]  Unicode Technical Report #36: Unicode Security Considerations
   http://www.unicode.org/reports/tr39/
.. [tr39]   Unicode® Technical Standard #39: Unicode Security Mechanisms
   http://www.unicode.org/reports/tr39/
.. [CVE-2021-42574]  CVE-2021-42574
   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End:

pre-PEP: Unicode Security Considerations for Python

tags

participants (13)