On 11/1/2021 8:17 AM, Petr Viktorin wrote:
Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.
Very helpful.
Bidirectional Text ------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.] There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes. In extended handling, phrases ...
Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren't familiar with these writing systems and their computer representation.
The exact process is complicated, and explained in Unicode® Standard Annex #9, "Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
In local handling, one sees <hebrew-rtl> = 23`. In extended handling, one sees 23 = <hebrew-rtl>. (Notepad++ sees backticks as quotes.)
Source Encoding ---------------
The encoding of Python source files is given by a specific regex on the first two lines of a file, as per `Encoding declarations`_. This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
This can be misused in combination with Python-specific special-purpose encodings (see `Text Encodings`_).
Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to something?
For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example::
I don't see the connection between the text above and the example that follows.
# For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well. [etc]
-- Terry Jan Reedy