
On Wed, Dec 21, 2022 at 01:18:46AM -0500, David Mertz, Ph.D. wrote:
I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?!
It has to be. Turkic languages like Turkish, Azerbaijani and Tatar distinguish dotted and non-dotted I's, leading to a slew of problems infamously known as "The Turkish I problem". (Other languages use undotted i's but not in the same way, e.g. Irish roadsigns in Gaelic usually drop the dot to avoid confusion with í. And don't confuse the undotted i with the Latin iota ɩ, which is a completely different letter to the Greek iota ι. Alphabets are hard.) In Turkic languages, we have: Letter: ı I i İ ----------- --- --- --- --- Lowercase: ı ı i i Uppercase: I I İ İ Swapping case can never add or remove a dot. (The technical name for the dot is "tittle".) Which is perfectly logical, of course. But most other people with Latin-based alphabets mix the dotted and dotless letters together, leading to this lossy table: Letter: ı I i İ ----------- --- --- --- --- Lowercase: ı i i i Uppercase: I I I İ which is the official Unicode case conversion, which Python follows.
"ıIiİ".lower() 'ıiii̇' "ıIiİ".upper() 'IIIİ'
Just to make the Turkish I problem even more exciting, you aren't supposed to use Turkish rules when changing the case of foreign proper nouns. So the popular children's book "Alice Harikalar Diyarında" (Alice in Wonderland) should use *both* sets of rules when uppercasing to give "ALICE HARİKALAR DİYARINDA". Sometimes the dot can be very significant. https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-3...
That feels like a BPO needed if true.
We do whatever the Unicode standard says to do. They say that localisation issues are out of scope for Unicode. -- Steve