[Tutor] Regex question

Sun Apr 3 16:20:12 CEST 2011

Hugo Arts wrote:

> 2011/4/3 "Andrés Chandía" <andres at chandia.net>:
>>
>>
>> I continue working with RegExp, but I have reached a point for wich I
>> can't find documentation, maybe there is no possible way to do it, any
>> way I throw the question:
>>
>> This is my code:
>>
>> contents = re.sub(r'Á',
>> "A", contents)
>> contents = re.sub(r'á', "a",
>> contents)
>> contents = re.sub(r'É', "E", contents)
>> contents = re.sub(r'é', "e", contents)
>> contents = re.sub(r'Í', "I", contents)
>> contents = re.sub(r'í', "i", contents)
>> contents = re.sub(r'Ó', "O", contents)
>> contents = re.sub(r'ó', "o", contents)
>> contents = re.sub(r'Ú', "U", contents)
>> contents = re.sub(r'ú', "u", contents)
>>
>> It is
>> clear that I need to convert any accented vowel into the same not
>> accented vowel, The
>> qestion is : is there a way to say that whenever you find an accented
>> character this one
>> has to change into a non accented character, but not every character, it
>> must be only this vowels and accented this way, because at the language I
>> am working with, there are letters
>> like ü, and ñ that should remain the same.
>>
> 
> Okay, first thing, forget about regexes for this problem.They're too
> complicated and not suited to it.
> 
> Encoding issues make this a somewhat complicated problem. In Unicode,
> There's two ways to encode most accented characters. For example, the
> character "Ć" can be encoded both by U+0106, "LATIN CAPITAL LETTER C
> WITH ACUTE", and a combination of U+0043 and U+0301, being simply 'C'
> and the 'COMBINING ACUTE ACCENT', respectively. You must remove both
> forms to be sure every accented character is gone from your string.
> 
> using unicode.translate, you can craft a translation table to
> translate the accented characters to their non-accented counterparts.
> The combining characters can simply be removed by mapping them to
> None.

If you go that road you might be interested in Fredrik Lundh's article at

http://effbot.org/zone/unicode-convert.htm

The class presented there is a bit tricky, but for your purpose it might be 
sufficient to subclass it:

>>> KEEP_CHARS = set(ord(c) for c in u"üñ")
>>> class Map(unaccented_map):
...     def __missing__(self, key):
...             if key in KEEP_CHARS:
...                     self[key] = key
...                     return key
...             return unaccented_map.__missing__(self, key)
...
>>> print u"äöü".translate(Map())
aoü