[Tutor] Regex question

Hugo Arts hugo.yoshi at gmail.com
Sun Apr 3 15:03:30 CEST 2011


2011/4/3 "Andrés Chandía" <andres at chandia.net>:
>
>
> I continue working with RegExp, but I have reached a point for wich I can't find
> documentation, maybe there is no possible way to do it, any way I throw the question:
>
> This is my code:
>
>     contents = re.sub(r'Á',
> "A", contents)
>     contents = re.sub(r'á', "a",
> contents)
>     contents = re.sub(r'É', "E", contents)
>     contents = re.sub(r'é', "e", contents)
>     contents = re.sub(r'Í', "I", contents)
>     contents = re.sub(r'í', "i", contents)
>     contents = re.sub(r'Ó', "O", contents)
>     contents = re.sub(r'ó', "o", contents)
>     contents = re.sub(r'Ú', "U", contents)
>     contents = re.sub(r'ú', "u", contents)
>
> It is
> clear that I need to convert any accented vowel into the same not accented vowel,
> The
> qestion is : is there a way to say that whenever you find an accented character this
> one
> has to change into a non accented character, but not every character, it must be only
> this vowels and accented this way, because at the language I am working with, there are
> letters
> like ü, and ñ that should remain the same.
>

Okay, first thing, forget about regexes for this problem.They're too
complicated and not suited to it.

Encoding issues make this a somewhat complicated problem. In Unicode,
There's two ways to encode most accented characters. For example, the
character "Ć" can be encoded both by U+0106, "LATIN CAPITAL LETTER C
WITH ACUTE", and a combination of U+0043 and U+0301, being simply 'C'
and the 'COMBINING ACUTE ACCENT', respectively. You must remove both
forms to be sure every accented character is gone from your string.

using unicode.translate, you can craft a translation table to
translate the accented characters to their non-accented counterparts.
The combining characters can simply be removed by mapping them to
None.

HTH,
Hugo


More information about the Tutor mailing list