[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2 Nov 2021

      On 11/1/2021 8:17 AM, Petr Viktorin wrote:
...
Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.
Very helpful.
...
...
Bidirectional Text
------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local 
(contiguous sequences are properly reversed), and extended (see below). 
  The handling depends on the display software and may depend on the 
quoting.  Tk and hence tkinter (and IDLE) text widgets do local handing. 
  Windows Notepad++ does local handling of unquoted code but extending 
handling of quoted text.  Windows Notepad currently does extended 
handling even without quotes.

In extended handling, phrases ...
...
...
Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems 
and their
computer representation.
The exact process is complicated, and explained in Unicode® Standard 
Annex #9,
"Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the 
integer 23.
In local handling, one sees <hebrew-rtl> = 23`.  In extended handling,
one sees 23 = <hebrew-rtl>.  (Notepad++ sees backticks as quotes.)
...
...
Source Encoding
---------------
The encoding of Python source files is given by a specific regex on 
the first
two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to 
obfuscate.
This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).
Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to 
something?
...
...
For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::
I don't see the connection between the text above and the example that 
follows.
...
...
    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
-- 
Terry Jan Reedy

[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

Terry Reedy