Python Unicode handling wins again -- mostly
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Nov 29 23:21:49 EST 2013
On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> In article <529934dc$0$29993$c3e8da3$5496439d at news.astraweb.com>,
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
>
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>>
>> Like most other languages, Python 3.2 fails:
>>
>> py> 'baffle'.upper()
>> 'BAfflE'
You edited my text to remove the ligature? That's... unfortunate.
>> but Python 3.3 passes:
>>
>> py> 'baffle'.upper()
>> 'BAFFLE'
>
> I disagree.
>
> The whole idea of ligatures like fi is purely typographic.
In English, that's correct. I'm not sure if we can generalise that to all
languages that have ligatures. It also partly depends on how you define
ligatures. For example, would you consider that ampersand & to be a
ligature? These days, I would consider & to be a distinct character, but
originally it began as a ligature for "et" (Latin for "and").
But let's skip such corner cases, as they provide much heat but no
illumination, and I'll agree that when it comes to ligatures like fl, fi
and ffl, they are purely typographic.
> The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
>
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do. Thus,
> I would argue that it's wrong to say that calling upper() on an ffl
> ligature should yield FFL.
Your conclusion doesn't follow from the argument you are making. Since
the ffl ligature ffl is purely a typographical feature, then it should
uppercase to FFL (there being no typographic feature for uppercase FFL
ligature).
Consider the examples shown above, where you or your software
unfortunately edited out the ligature and replaced it with ASCII "ffl".
Or perhaps I should say *fortunately*, since it demonstrates the problem.
Since we agree that the ffl ligature is merely a typographic artifact of
some type-designers whimsy, we can expect that the word "baffle" is
semantically exactly the same as the word "baffle". How foolish Python
would look if it did this:
py> 'baffle'.upper()
'BAfflE'
Replace the 'ffl' with the ligature, and the conclusion remains:
py> 'baffle'.upper()
'BAfflE'
would be equally wrong.
Now, I accept that this picture isn't entirely black and white. For
example, we might argue that if ffl is purely typographical in nature,
surely we would also want 'baffle' == 'baffle' too? Or maybe not. This
indicates that capturing *all* the rules for text across the many
languages, writing systems and conventions is impossible.
There are some circumstances where we would want 'baffle' and 'baffle' to
compare equal, and others where we would want them to compare the same.
Python gives us both:
py> "bapy> "baffle" == "baffle"
False
ffle" == unicodedata.normalize("NFKC", "baffle")
True
but frankly I'm baffled *wink* that you think there are any circumstances
where you would want the uppercase of ffl to be anything but FFL.
> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.
You would expect wrongly. You are over-generalising from English, and if
you include ligatures and other special cases, not even all of English.
See, for example:
http://www.unicode.org/faq/casemap_charprop.html#7a
Apart from ligatures, some examples of troublesome characters with regard
to case are:
* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending
on context, particular when dealing with placenames and family names.
(That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
least the 1930s, although the official rules of German orthography
still insist on uppercasing ß to SS.)
* The English long-s ſ is uppercased to regular S.
* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
letters I and i but the case conversion rules are different.
* Both the Greek sigma σ and final sigma ς uppercase to Σ.
That last one is especially interesting: Python 3.3 gets it right, while
older Pythons do not. In Python 3.2:
py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύσ (Odysseus)'
while in 3.3 it roundtrips correctly:
py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύς (Odysseus)'
So... case conversions are not as simple as they appear at first glance.
They aren't always reversible, nor do they always roundtrip. Titlecase is
not necessarily the same as "uppercase the first letter and lowercase the
rest". Case conversions can be context or locale sensitive.
Anyway... even if you disagree with everything I have said, it is a fact
that Python has committed to following the Unicode standard, and the
Unicode standard requires that certain ligatures, including FFL, FL and
FI, are decomposed when converted to uppercase.
--
Steven
More information about the Python-list
mailing list