On 15.11.2021 12:36, Steven D'Aprano wrote:
On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
I am, however, surprised and disappointed by the NKFC normalization.
For example, in writing math we often use different scripts to mean different things (e.g. TeX's Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.
Hmmm... would you really want these to all be different identifiers?
π π π© π B
You're assuming the reader of the code has the right typeface to view them (rather than as mere boxes), and that their eyesight is good enough to distinguish the variations even if their editor applies bold or italic as part of syntax highlighting. That's very bold of you :-)
In any case, the question of NFKC versus NFC was certainly considered, but unfortunately PEP 3131 doesn't document why NFKC was chosen.
https://www.python.org/dev/peps/pep-3131/
Before we change the normalisation rules, it would probably be a good idea to trawl through the archives of the mailing list and work out why NFKC was chosen in the first place, or contact Martin von LΓΆwis and see if he remembers.
This was raised in the discussion, but never conclusively answered: https://mail.python.org/pipermail/python-3000/2007-May/007995.html NFKC is the standard normalization form when you want remove any typography related variants/hints from the text before comparing strings. See http://www.unicode.org/reports/tr15/ I guess that's why Martin chose this form, since the point was to maintain readability, even if different variants of a character are used in the source code. A "B" in the source code should be interpreted as an ASCII B, even when written as π π π© or π. This simplifies writing code and does away with many of the security issues you could otherwise run into (where e.g. the absence of an identifier causes the application flow to be different).
Then there's the question of when this normalization happens (and when it doesn't).
It happens in the parser when reading a non-ASCII identifier (see Parser/pegen.c), so only applies to source code, not attributes you dynamically add to e.g. class or module namespaces. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 15 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/