adding a casefold() method to str

Hi, Casefolding (Unicode Standard 3.13) is a more aggressive version of lowercasing. It's purpose to assist in the implementation of caseless mapping. For example, under lowercase "ß" -> "ß" but under casefolding "ß" -> "ss". I propose we add a casefold() method. So, case-insensitive matching should really be "one.casefold() == two.casefold()" rather than "one.lower() == two.lower()". Regards, Benjamin

Benjamin Peterson wrote:
+1 in principle, but in practice case folding is more complicated than a single method might imply. The most obvious complication is treatment of dotted and dotless I. See, for example: http://unicode.org/Public/UNIDATA/CaseFolding.txt http://www.w3.org/International/wiki/Case_folding http://en.wikipedia.org/wiki/Letter_case#Unicode_case_folding_and_script_ide... So while having proper Unicode case-folding is desirable, I don't know how simple it is to implement. Would it be appropriate for casefold() to take an optional argument as to which mappings to use? E.g. something like: str.casefold() # defaults to simple folding str.casefold(string.SIMPLE & string.TURKIC) str.casefold(string.FULL) or should str.casefold() only apply simple folding, with the others combinations relegated to a function in a module somewhere? I count 4 possible functions: simple casefolding, without Turkic I full casefolding, without Turkic I simple casefolding, with Turkic I full casefolding, with Turkic I -- Steven

Steven D'Aprano <steve@...> writes:
or should str.casefold() only apply simple folding, with the others combinations relegated to a function in a module somewhere?
Yes, I think so. str does not have any other features dependent on locale. Section 3.3 defines "Default casefolding" which is what the casefold() method should use.

Benjamin Peterson wrote:
+1 in principle, but in practice case folding is more complicated than a single method might imply. The most obvious complication is treatment of dotted and dotless I. See, for example: http://unicode.org/Public/UNIDATA/CaseFolding.txt http://www.w3.org/International/wiki/Case_folding http://en.wikipedia.org/wiki/Letter_case#Unicode_case_folding_and_script_ide... So while having proper Unicode case-folding is desirable, I don't know how simple it is to implement. Would it be appropriate for casefold() to take an optional argument as to which mappings to use? E.g. something like: str.casefold() # defaults to simple folding str.casefold(string.SIMPLE & string.TURKIC) str.casefold(string.FULL) or should str.casefold() only apply simple folding, with the others combinations relegated to a function in a module somewhere? I count 4 possible functions: simple casefolding, without Turkic I full casefolding, without Turkic I simple casefolding, with Turkic I full casefolding, with Turkic I -- Steven

Steven D'Aprano <steve@...> writes:
or should str.casefold() only apply simple folding, with the others combinations relegated to a function in a module somewhere?
Yes, I think so. str does not have any other features dependent on locale. Section 3.3 defines "Default casefolding" which is what the casefold() method should use.
participants (2)
-
Benjamin Peterson
-
Steven D'Aprano