Is unicode.lower() locale-independent?
sjmachin at lexicon.net
Sun Jan 13 02:12:51 CET 2008
On Jan 13, 10:31 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > The Unicode standard says that case mappings are language-dependent.
> I think you are misreading it.
Ummm well, it does say "normative" as opposed to Fredrik's
> 5.18 "Implementation Guides" says
> (talking about "most environments") "In such cases, the
> language-specific mappings *must not* be used." (emphasis also
> in the original spec).
Here is the paragraph from which you quote:
In most environments, such as in file systems, text is not and cannot
be tagged with language information. In such cases, the language-
specific mappings /must not/ be used. Otherwise, data structures such
as B-trees might be built based on one set of case foldings and used
based on a different set of case foldings. This discrepancy would
cause those data structures to become corrupt. For such environments,
a constant, language-independent, default case folding is required.
This is from the middle of a section titled "Caseless Matching"; this
Caseless matching is implemented using case folding, which is the
process of mapping strings to a canonical form where case differences
are erased. Case folding allows for fast
caseless matches in lookups because only binary comparison is
required. It is more than just conversion to lowercase. For example,
it correctly handles cases such as the Greek sigma, so that
<scrambled_in_transmission1> and <scrambled_in_transmission2> will
Python doesn't offer a foldedcase method, and the attitude of 99% of
users would be YAGNI; use this:
foldedcase = lambda x: x.lower()
What the paragraph you quoted seems to be warning about is that people
who do implement a fully-principled foldedcase using the Unicode
CaseFolding.txt file should be careful about offering foldedcaseTurkic
and foldedcaseLithuanianDictionary -- both dangerous and YAGNI**2.
This topic seems to be quite different to the topic of whether the
results of unicode.lower does/should depend on the locale or not.
More information about the Python-list