unicode issue

Thu Oct 1 12:03:38 EDT 2009

On 01.10.09 17:50, Rami Chowdhury wrote:
> On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald
> <walter at livinglogic.de> wrote:
> 
>> On 01.10.09 16:09, Hyuga wrote:
>>> On Sep 30, 3:34 am, gentlestone <tibor.b... at hotmail.com> wrote:
>>>> Why don't work this code on Python 2.6? Or how can I do this job?
>>>>
>>>> [snip _MAP]
>>>>
>>>> def downcode(name):
>>>>     """
>>>>     >>> downcode(u"Žabovitá zmiešaná kaša")
>>>>     u'Zabovita zmiesana kasa'
>>>>     """
>>>>     for key, value in _MAP.iteritems():
>>>>         name = name.replace(key, value)
>>>>     return name
>>>
>>> Though C Python is pretty optimized under the hood for this sort of
>>> single-character replacement, this still seems pretty inefficient
>>> since you're calling replace for every character you want to map.  I
>>> think that a better approach might be something like:
>>>
>>> def downcode(name):
>>>     return ''.join(_MAP.get(c, c) for c in name)
>>>
>>> Or using string.translate:
>>>
>>> import string
>>> def downcode(name):
>>>     table = string.maketrans(
>>>         'ÀÁÂÃÄÅ...',
>>>         'AAAAAA...')
>>>     return name.translate(table)
>>
>> Or even simpler:
>>
>> import unicodedata
>>
>> def downcode(name):
>>    return unicodedata.normalize("NFD", name)\
>>           .encode("ascii", "ignore")\
>>           .decode("ascii")
>>
>> Servus,
>>    Walter
> 
> As I understand it, the "ignore" argument to str.encode *removes* the
> undecodable characters, rather than replacing them with an ASCII
> approximation. Is that correct? If so, wouldn't that rather defeat the
> purpose?

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

   .encode("ascii", "ignore").decode("ascii")

with something like this:

   u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
   Walter