[Python-Dev] logging module broken because of locale

M.-A. Lemburg mal at egenix.com
Tue Jul 18 23:03:54 CEST 2006


Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>> The Unicode database OTOH *defines* the upper/lower case mapping in
>> a locale independent way, so the mappings are guaranteed
>> to always produce the same results on all platforms.
> 
> Actually, that isn't the full truth; see UAX#21, which is now official
> part of Unicode 4. It specifies two kinds of case conversion:
> simple case conversion, and full case conversion. Python only supports
> simple case conversion at the moment. Full case conversion is context
> (locale) dependent, and must take into account SpecialCasing.txt.

Right. In fact, some case mappings are not available in the Unicode
database, since that only contains mappings which don't increase or
decrease the length of the Unicode string. A typical example is the
German u'ß'. u'ß'.upper() would have to give u'SS', but instead
returns u'ß'.

However, the point I wanted to make was that these mappings don't depend
on the locale setting of the C lib - you have to explicitly
access the mapping in the context of a locale and/or text.

As an example, here's the definition for the dotted/dotless i's in
Turkish taken from that file
(http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt):

"""
# The entries in this file are in the following machine-readable format:
#
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
#

...

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above,
which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a
dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

# Note: the following case is already in the UnicodeData file.

# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
"""

Note how the context of the usage of the code points matters
when doing case-conversions.

To make things even more complicated, there are so called
language tags which can be embedded into the Unicode string,
so the language can also change within a Unicode string.

	http://www.unicode.org/reports/tr7/

To get a feeling of what it takes to do locale aware handling
of Unicode right, have a look at the Locale Data Markup
Language (LDML):

	http://www.unicode.org/reports/tr35/

(hey, perhaps Google could contribute support for this to Python ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 18 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list