Can upper() or lower() ever change the length of a string?
MRAB
python at mrabarnett.plus.com
Mon May 24 10:42:06 EDT 2010
Mark Dickinson wrote:
> On May 24, 1:13 pm, Steven D'Aprano <st... at REMOVE-THIS-
> cybersource.com.au> wrote:
>> Do unicode.lower() or unicode.upper() ever change the length of the
>> string?
>>
>> The Unicode standard allows for case conversions that change length, e.g.
>> sharp-S in German should convert to SS:
>>
>> http://unicode.org/faq/casemap_charprop.html#6
>>
>> but I see that Python doesn't do that:
>>
>>>>> s = "Paßstraße"
>>>>> s.upper()
>> 'PAßSTRAßE'
>>
>> The more I think about this, the more I think that upper/lower/title case
>> conversions should change length (at least sometimes) and if Python
>> doesn't do so, that's a bug. Any thoughts?
>
> Digging a bit deeper, it looks like these methods are using the
> Simple_{Upper,Lower,Title}case_Mapping functions described at
> http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
> of the unicode data; you can see this in the source in Tools/unicode/
> makeunicodedata.py, which is the Python code that generates the
> database of unicode properties. It contains code like:
>
> if record[12]:
> upper = int(record[12], 16)
> else:
> upper = char
> if record[13]:
> lower = int(record[13], 16)
> else:
> lower = char
> if record[14]:
> title = int(record[14], 16)
>
> ... and so on.
>
> I agree that it might be desirable for these operations to product the
> multicharacter equivalents. That idea looks like a tough sell,
> though: apart from backwards compatibility concerns (which could
> probably be worked around somehow), it looks as though it would
> require significant effort to implement.
>
If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').
For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "İ" (the uppercase version of lowercase dotted
i is uppercase dotted I).
More information about the Python-list
mailing list