unicode issue
Peter Otten
__peter__ at web.de
Thu Oct 1 12:01:11 EDT 2009
Rami Chowdhury wrote:
> On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald <walter at livinglogic.de>
> wrote:
>
>> On 01.10.09 16:09, Hyuga wrote:
>>> On Sep 30, 3:34 am, gentlestone <tibor.b... at hotmail.com> wrote:
>>>> Why don't work this code on Python 2.6? Or how can I do this job?
>>>>
>>>> [snip _MAP]
>>>>
>>>> def downcode(name):
>>>> """
>>>> >>> downcode(u"Žabovitá zmiešaná kaša")
>>>> u'Zabovita zmiesana kasa'
>>>> """
>>>> for key, value in _MAP.iteritems():
>>>> name = name.replace(key, value)
>>>> return name
>>>
>>> Though C Python is pretty optimized under the hood for this sort of
>>> single-character replacement, this still seems pretty inefficient
>>> since you're calling replace for every character you want to map. I
>>> think that a better approach might be something like:
>>>
>>> def downcode(name):
>>> return ''.join(_MAP.get(c, c) for c in name)
>>>
>>> Or using string.translate:
>>>
>>> import string
>>> def downcode(name):
>>> table = string.maketrans(
>>> 'ÀÁÂÃÄÅ...',
>>> 'AAAAAA...')
>>> return name.translate(table)
>>
>> Or even simpler:
>>
>> import unicodedata
>>
>> def downcode(name):
>> return unicodedata.normalize("NFD", name)\
>> .encode("ascii", "ignore")\
>> .decode("ascii")
>>
>> Servus,
>> Walter
>
> As I understand it, the "ignore" argument to str.encode *removes* the
> undecodable characters, rather than replacing them with an ASCII
> approximation. Is that correct? If so, wouldn't that rather defeat the
> purpose?
You didn't take the normalization step into your consideration. Example:
>>> import unicodedata
>>> s = u"Ä"
>>> unicodedata.normalize("NFD", s)
u'A\u0308'
>>> _.encode("ascii", "ignore")
'A'
More information about the Python-list
mailing list