Is unicode.lower() locale-independent?
robert.kern at gmail.com
Sat Jan 12 12:39:17 CET 2008
John Machin wrote:
> On Jan 12, 8:25 pm, Robert Kern <robert.k... at gmail.com> wrote:
>> The section on "String Methods" in the Python documentation states that for
>> the case conversion methods like str.lower(), "For 8-bit strings, this method is
>> locale-dependent." Is there a guarantee that unicode.lower() is
>> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
>> looks like to may call the C function towlower() if it is available. On OS X
>> Leopard, the manpage for towlower(3) states that it "uses the current locale"
>> though it doesn't say exactly *how* it uses it.
>> This is the bug I'm trying to fix:
> The Unicode standard says that case mappings are language-dependent.
> It gives the example of the Turkish dotted capital letter I and
> dotless small letter i that "caused" the numpy problem. See
That doesn't determine the behavior of unicode.lower(), I don't think. That
specifies semantics for when one is dealing with a given language in the
abstract. That doesn't specify concrete behavior with respect to a given locale
setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all
English, and I want English case behavior. The language of the data and the
transformations I want to apply to the data is English even though the user may
have set the locale to something else.
> Here is what the Python 2.5.1 unicode implementation does in an
> English-language locale:
>>>> import unicodedata as ucd
>>>> eyes = u"Ii\u0130\u0131"
>>>> for eye in eyes:
> ... print repr(eye), ucd.name(eye)
> u'I' LATIN CAPITAL LETTER I
> u'i' LATIN SMALL LETTER I
> u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
> u'\u0131' LATIN SMALL LETTER DOTLESS I
>>>> for eye in eyes:
> ... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
> u'I' u'I' u'i' u'I'
> u'i' u'I' u'i' u'I'
> u'\u0130' u'\u0130' u'i' u'\u0130'
> u'\u0131' u'I' u'\u0131' u'I'
> The conversions for I and i are not correct for a Turkish locale.
> I don't know how to repeat the above in a Turkish locale.
If you have the correct locale data in your operating system, this should be
sufficient, I believe:
$ LANG=tr_TR python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
> However it appears from your bug ticket that you have a much narrower
> problem (case-shifting a small known list of English words like VOID)
> and can work around it by writing your own locale-independent casing
> functions. Do you still need to find out whether Python unicode
> casings are locale-dependent?
I would still like to know. There are other places where .lower() is used in
numpy, not to mention the rest of my code.
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list