[Python-Dev] PEP 393 Summer of Code Project

Stefan Behnel stefan_ml at behnel.de
Thu Sep 1 19:04:34 CEST 2011


Guido van Rossum, 01.09.2011 18:31:
> On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou wrote:
>> Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
>>> This is definitely thought of as a separate
>>> mark added to the e; ë is not a new letter. I have a feeling it's the
>>> same way for the French and Germans, but I really don't know.
>>> (Antoine? Georg?)
>>
>> Indeed, they are not separate "letters" (they are considered the same in
>> lexicographic order, and the French alphabet has 26 letters).

So does the German alphabet, even though that does not include "ß", which 
basically descended from a ligature of the old German way of writing "sz", 
where "s" looked similar to an "f" and "z" had a low hanging tail.

IIRC, German Umlaut letters are lexicographically sorted according to their 
emergency replacement spelling ("ä" -> "ae"), which is also sometimes used 
in all upper case words ("Glück" -> "GLUECK"). I guess that's because 
Umlaut dots are harder to see on top of upper case letters. So, Latin-1 
byte value sorting always yields totally wrong results.

That aside, Umlaut letters are commonly considered separate letters, 
different from the undotted letters and also different from the replacement 
spellings. I, for one, always found the replacements rather weird and never 
got used to using them in upper case words. In any case, it's wrong to 
always use them, and it makes text harder to read.


>> But I'm not sure how it's relevant, because you can't remove an accent
>> without most likely making a spelling error, or at least changing the
>> meaning. Accents are very much part of the language (while ligatures
>> like "ff" are not, they are a rendering detail). So I would consider
>> "é", "ê", "ù", etc. atomic characters for the purpose of processing
>> French text. And I don't see how a decomposed form could help an
>> application.
>
> I recall long ago that when the french wrote words in all caps they
> would drop the accents, e.g. ECOLE. I even recall (through the mists
> of time) observing this in Paris on public signs. Is this still the
> convention?

Yes, and it's a huge problem when trying to pronounce last names. In 
French, you'd commonly write

LASTNAME, Firstname

and if LASTNAME happens to have accented letters, you'd miss them when 
reading that. I know a couple of French people who severely suffer from 
this, because the pronunciation of their name gets a totally different 
meaning without accents.

Stefan



More information about the Python-Dev mailing list