[issue13828] Further improve casefold documentation
New submission from Jim Jewett <jimjjewett@gmail.com>:
http://hg.python.org/cpython/rev/0b5ce36a7a24 changeset: 74515:0b5ce36a7a24
+ Casefolding is similar to lowercasing but more aggressive because it is + intended to remove all case distinctions in a string. For example, the German + lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already + lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` + converts it to ``"ss"``.
Perhaps add the recommendation to canonicalize as well. A complete, but possibly too long, try is below: Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` converts it to ``"ss"``. Note that most case-insensitive matches should also match compatibility equivalent characters. The casefolding algorithm is described in section 3.13 of the Unicode Standard. Per D146, a compatibility caseless match can be achieved by from unicodedata import normalize def caseless_compat(string): nfd_string = normalize("NFD", string) nfkd1_string = normalize("NFKD", nfd_string.casefold()) return normalize("NFKD", nfkd1_string.casefold()) ---------- assignee: docs@python components: Documentation messages: 151644 nosy: Jim.Jewett, benjamin.peterson, docs@python priority: normal severity: normal status: open title: Further improve casefold documentation versions: Python 3.3 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue13828> _______________________________________
Jim Jewett <jimjjewett@gmail.com> added the comment: Frankly, I do think that sample code is too long, but correctness matters ... perhaps a better solution would be to add either a method or a unicodedata function that does the work, then the extra note could just say Note that most case-insensitive matches should also match compatibility equivalent characters; see unicodedata.compatibity_casefold ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue13828> _______________________________________
Benjamin Peterson <benjamin@python.org> added the comment: It's a bit unfriendly to launch into discussion of "compatiblity caseless matching" when the new reader probably has no idea what "compatibility-equivalence" is. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue13828> _______________________________________
Mark Summerfield added the comment: I think the str.casefold() docs are fine as far as they go, rightly covering what it _does_ rather than _how_, yet providing a reference for the details. But what they lack is more complete information. For example I discovered this:
x = "files and shuffles" x 'files and shuffles' x.casefold() 'files and shuffles'
In view of this I would add one sentence: In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi". ---------- nosy: +mark _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue13828> _______________________________________
Raymond Hettinger added the comment:
In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
+1 I would have found that sentence to be helpful. ---------- nosy: +rhettinger _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue13828> _______________________________________
Marc Richter <marc.richter.1982@googlemail.com> added the comment: +1 as well. To be honest, I did not understand what this function does in detail yet. Since not too long ago (2017) in Germany, there was an uppercase-variant for the special letter from this function's example (ß) been added to the official orthography [1]. Is this something that needs to be changed in this function's behavior now or stays this expected behavior? I'm still puzzled and I think the whole function should get a more clear description. [1]: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E ---------- nosy: +Marc Richter _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
Cheryl Sabella <cheryl.sabella@gmail.com> added the comment: Assigning to @Mariatta for the sprints. ---------- assignee: docs@python -> Mariatta nosy: +Mariatta, cheryl.sabella stage: -> needs patch versions: +Python 3.7, Python 3.8 -Python 3.3 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
Thorsten <mrsupertash@gmail.com> added the comment: German example in casefolding is plain incorrect. #Casefolding is similar to lowercasing but more aggressive because it is #intended to remove all case distinctions in a string. For example, the #German lowercase letter 'ß' is equivalent to "ss". Since it is already #lowercase, lower() would do nothing to 'ß'; casefold() converts it to #"ss". It is not true that "ß" is equivalent to "ss" and has not been since an orthography reform in 1996. These are to be used in distinct use cases. "ß" after a diphthong or a long/open vowel. "ss" after a short/closed vowel. The documentation correctly describes (in this case) how Python handles the .casefold() for this letter, although the behavior itself is incorrect. As mentioned before, in 2017 an official upper-case version of "ß" has been introduced into German orthography: "ẞ". The German example should be stated as current incorrect behavior in the documentation. +1 to adding previously mentioned sentence: In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi". ---------- nosy: +MrSupertash _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
Benjamin Peterson <benjamin@python.org> added the comment: Correctness of casefolding is defined by the Unicode standard, which currently states that "ß" folds to "ss". ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
Thorsten <mrsupertash@gmail.com> added the comment: I see. I found the documents. That's an issue. That usage is incorrect. It is still valid to upper case "ß" to SS since "ẞ" is fairly new as an official German character, but the other way around is not valid. As such the current sentence in documentation also just does not make sense.
"Since it is already lowercase, lower() would do nothing to 'ß'"
Exactly. Why would it? It is nonsensical to change an already lowercase character with a lowercase function. Suggest to update to: "For example, the Unicode standard for German lower case letter 'ß' prescribes full casefolding to 'ss'. Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to 'ss'. In addition to full lowercasing, this function also expands ligatures, for example, 'fi' becomes 'fi'." ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
Jim Jewett <jimjjewett@gmail.com> added the comment: Unicode probably won't make the correction, because of backwards compatibility. I do support the sentence suggested in Thorsten's most recent reply. Is expanding ligatures the only other normalization it does? Ideally, we should also mention that it shifts to the canonical case, which is usually (but not always) lowercase. I think Cherokee is one that folds to the upper case. On Mon, Aug 24, 2020 at 11:02 AM Thorsten <report@bugs.python.org> wrote:
Thorsten <mrsupertash@gmail.com> added the comment:
I see. I found the documents. That's an issue. That usage is incorrect. It is still valid to upper case "ß" to SS since "ẞ" is fairly new as an official German character, but the other way around is not valid.
As such the current sentence in documentation also just does not make sense.
"Since it is already lowercase, lower() would do nothing to 'ß'"
Exactly. Why would it? It is nonsensical to change an already lowercase character with a lowercase function.
Suggest to update to:
"For example, the Unicode standard for German lower case letter 'ß' prescribes full casefolding to 'ss'. Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to 'ss'. In addition to full lowercasing, this function also expands ligatures, for example, 'fi' becomes 'fi'."
----------
_______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue13828> _______________________________________
participants (7)
-
Benjamin Peterson
-
Cheryl Sabella
-
Jim Jewett
-
Marc Richter
-
Mark Summerfield
-
Raymond Hettinger
-
Thorsten