[Tutor] ignoring diacritical signs

Mark Lawrence breamoreboy at yahoo.co.uk
Mon Dec 2 19:00:17 CET 2013


On 02/12/2013 15:53, Steven D'Aprano wrote:
> On Mon, Dec 02, 2013 at 06:11:04AM -0800, Albert-Jan Roskam wrote:
>> Hi,
>>
>> I created the code below because I want to compare two fields while
>> ignoring the diacritical signs.
>
> Why would you want to do that? That's like comparing two fields while
> ignoring the difference between "e" and "i", or "s" and "z", or "c" and
> "k". Or indeed between "s", "z", "c" and "k".
>
> *only half joking*
>
>
> I think the right way to ignore diacritics and other combining marks is
> with a function like this:
>
> import unicodedata
>
> def strip_marks(s):
>      decomposed = unicodedata.normalize('NFD', s)
>      base_chars = [c for c in decomposed if not unicodedata.combining(c)]
>      return ''.join(base_chars)
>
>
> Example:
>
> py> strip_marks("I will coöperate with Müller's résumé mañana.")
> "I will cooperate with Muller's resume manana."
>
>
> Beware: stripping accents may completely change the meaning of the word
> in many languages! Even in English, stripping the accents from "résumé"
> makes the word ambiguous (do you mean a CV, or the verb to start
> something again?). In other languages, stripping accents may completely
> change the word, or even turn it into nonsense.
>
> For example, I understand that in Danish, å is not the letter a with a
> circle accent on it, but a distinct letter of the alphabet which should
> not be touched. And I haven't even considered non-Western European
> languages, like Greek, Polish, Russian, Arabic, Hebrew...

You've actually shown a perfect example above.  The Spanish letter ñ has 
become the quite distinct Spanish letter n.  And let's not go here 
http://spanish.about.com/b/2010/11/29/two-letters-dropped-from-spanish-alphabet.htm. 
  We should just stick with English as we all know that's easy, don't 
we? http://www.i18nguy.com/chaos.html :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence



More information about the Tutor mailing list