Unicode: matching a word and unaccenting characters
rndblnch at gmail.com
rndblnch at gmail.com
Wed Nov 14 20:33:00 EST 2007
On Nov 15, 1:21 am, Jeremie Le Hen <jere... at le-hen.org> wrote:
> (Mail resent with the proper subject.
>
> Hi list,
>
> (Please Cc: me when replying, as I'm not subscribed to this list.)
Don't know your mail, hope you will come back to look at the list
archive...
> I'm working with Unicode strings to handle accented characters but I'm
> experiencing a few problem.
[skipped first question]
> Secondly, I need to translate accented characters to their unaccented
> form. I've written this function (sorry if the code isn't as efficient
> as possible, I'm not a long-time Python programmer, feel free to correct
> me, I' be glad to learn anything):
>
> % def unaccent(s):
> % """
> % """
> %
> % if not isinstance(s, types.UnicodeType):
> % return s
> % singleletter_re = re.compile(r'(?:^|\s)([A-Z])(?:$|\s)')
> % result = ''
> % for l in s:
> % desc = unicodedata.name(l)
> % m = singleletter_re.search(desc)
> % if m is None:
> % result += str(l)
> % continue
> % result += m.group(1).lower()
> % return result
> %
>
> But I don't feel confortable with it. It strongly depend on the UCD
> file format and names that don't contain a single letter cannot
> obvisouly all be converted to ascii. How would you implement this
> function?
my 2 cents:
<unaccent.py>
# -*- coding: utf-8 -*-
import unicodedata
def unaccent(s):
u"""
>>> unaccent(u"Ça crée déjà l'évènement")
"Ca cree deja l'evenement"
"""
s = unicodedata.normalize('NFD', unicode(s.encode("utf-8"),
encoding="utf-8"))
return "".join(b for b in s.encode("utf-8") if ord(b) < 128)
def _test():
import doctest
doctest.testmod()
if __name__ == "__main__":
import sys
sys.exit(_test())
</unaccent.py>
> Thank you for your help.
you are welcome.
(left to the reader:
- why does it work?
- why does doctest work?)
renaud
> Regards,
> --
> Jeremie Le Hen
> < jlehen at clesys dot fr >
>
> ----- End forwarded message -----
>
> --
> Jeremie Le Hen
> < jlehen at clesys dot fr >
More information about the Python-list
mailing list