[Python-ideas] adding a casefold() method to str

Steven D'Aprano steve at pearwood.info
Sun Jan 8 17:58:17 CET 2012


Benjamin Peterson wrote:
> Hi,
> Casefolding (Unicode Standard 3.13) is a more aggressive version of lowercasing.
> It's purpose to assist in the implementation of caseless mapping. For example,
> under lowercase "ß" -> "ß" but under casefolding "ß" -> "ss". I propose we add a
> casefold() method. So, case-insensitive matching should really be
> "one.casefold() == two.casefold()"
> rather than "one.lower() == two.lower()".

+1 in principle, but in practice case folding is more complicated than a 
single method might imply. The most obvious complication is treatment of 
dotted and dotless I.

See, for example:

http://unicode.org/Public/UNIDATA/CaseFolding.txt
http://www.w3.org/International/wiki/Case_folding
http://en.wikipedia.org/wiki/Letter_case#Unicode_case_folding_and_script_identification

So while having proper Unicode case-folding is desirable, I don't know how 
simple it is to implement.

Would it be appropriate for casefold() to take an optional argument as to 
which mappings to use? E.g. something like:

str.casefold()  # defaults to simple folding
str.casefold(string.SIMPLE & string.TURKIC)
str.casefold(string.FULL)

or should str.casefold() only apply simple folding, with the others 
combinations relegated to a function in a module somewhere?

I count 4 possible functions:

simple casefolding, without Turkic I
full casefolding, without Turkic I
simple casefolding, with Turkic I
full casefolding, with Turkic I




-- 
Steven



More information about the Python-ideas mailing list