unicode issue

Thu Oct 1 12:01:11 EDT 2009

Rami Chowdhury wrote:

> On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald <walter at livinglogic.de>
> wrote:
> 
>> On 01.10.09 16:09, Hyuga wrote:
>>> On Sep 30, 3:34 am, gentlestone <tibor.b... at hotmail.com> wrote:
>>>> Why don't work this code on Python 2.6? Or how can I do this job?
>>>>
>>>> [snip _MAP]
>>>>
>>>> def downcode(name):
>>>>     """
>>>>     >>> downcode(u"Žabovitá zmiešaná kaša")
>>>>     u'Zabovita zmiesana kasa'
>>>>     """
>>>>     for key, value in _MAP.iteritems():
>>>>         name = name.replace(key, value)
>>>>     return name
>>>
>>> Though C Python is pretty optimized under the hood for this sort of
>>> single-character replacement, this still seems pretty inefficient
>>> since you're calling replace for every character you want to map.  I
>>> think that a better approach might be something like:
>>>
>>> def downcode(name):
>>>     return ''.join(_MAP.get(c, c) for c in name)
>>>
>>> Or using string.translate:
>>>
>>> import string
>>> def downcode(name):
>>>     table = string.maketrans(
>>>         'ÀÁÂÃÄÅ...',
>>>         'AAAAAA...')
>>>     return name.translate(table)
>>
>> Or even simpler:
>>
>> import unicodedata
>>
>> def downcode(name):
>>    return unicodedata.normalize("NFD", name)\
>>           .encode("ascii", "ignore")\
>>           .decode("ascii")
>>
>> Servus,
>>    Walter
> 
> As I understand it, the "ignore" argument to str.encode *removes* the
> undecodable characters, rather than replacing them with an ASCII
> approximation. Is that correct? If so, wouldn't that rather defeat the
> purpose?

You didn't take the normalization step into your consideration. Example:

>>> import unicodedata
>>> s = u"Ä"
>>> unicodedata.normalize("NFD", s)
u'A\u0308'
>>> _.encode("ascii", "ignore")
'A'