unicode issue
Walter Dörwald
walter at livinglogic.de
Thu Oct 1 12:03:38 EDT 2009
On 01.10.09 17:50, Rami Chowdhury wrote:
> On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald
> <walter at livinglogic.de> wrote:
>
>> On 01.10.09 16:09, Hyuga wrote:
>>> On Sep 30, 3:34 am, gentlestone <tibor.b... at hotmail.com> wrote:
>>>> Why don't work this code on Python 2.6? Or how can I do this job?
>>>>
>>>> [snip _MAP]
>>>>
>>>> def downcode(name):
>>>> """
>>>> >>> downcode(u"Žabovitá zmiešaná kaša")
>>>> u'Zabovita zmiesana kasa'
>>>> """
>>>> for key, value in _MAP.iteritems():
>>>> name = name.replace(key, value)
>>>> return name
>>>
>>> Though C Python is pretty optimized under the hood for this sort of
>>> single-character replacement, this still seems pretty inefficient
>>> since you're calling replace for every character you want to map. I
>>> think that a better approach might be something like:
>>>
>>> def downcode(name):
>>> return ''.join(_MAP.get(c, c) for c in name)
>>>
>>> Or using string.translate:
>>>
>>> import string
>>> def downcode(name):
>>> table = string.maketrans(
>>> 'ÀÁÂÃÄÅ...',
>>> 'AAAAAA...')
>>> return name.translate(table)
>>
>> Or even simpler:
>>
>> import unicodedata
>>
>> def downcode(name):
>> return unicodedata.normalize("NFD", name)\
>> .encode("ascii", "ignore")\
>> .decode("ascii")
>>
>> Servus,
>> Walter
>
> As I understand it, the "ignore" argument to str.encode *removes* the
> undecodable characters, rather than replacing them with an ASCII
> approximation. Is that correct? If so, wouldn't that rather defeat the
> purpose?
Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace
.encode("ascii", "ignore").decode("ascii")
with something like this:
u"".join(c for c in name if unicodedata.category(c) == "Mn")
Servus,
Walter
More information about the Python-list
mailing list