Unicode: matching a word and unaccenting characters
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Wed Nov 14 20:27:25 EST 2007
En Wed, 14 Nov 2007 21:21:55 -0300, Jeremie Le Hen <jeremie at le-hen.org>
escribió:
> (Please Cc: me when replying, as I'm not subscribed to this list.)
Not a good thing. *I* may CC you now, but any further replies and comments
from other people may leave the CC out. You can always browse this
newsgroup at Google http://groups.google.com/group/comp.lang.python or
Gmane http://dir.gmane.org/gmane.comp.python.general
> The first one is with regular expression. If I want to match a word
> composed of characters only. One can easily use '[a-zA-Z]+' when
> working in ascii, but unfortunately there is no equivalent when working
> with unicode strings: the latter doesn't match accented characters. The
> only mean the re package provides is '\w' along with the re.UNICODE
> flag, but unfortunately it also matches digits and underscore. It
> appears there is no suitable solution for this currently. Am I right?
I think you're right, unfortunately.
> Secondly, I need to translate accented characters to their unaccented
> form. I've written this function (sorry if the code isn't as efficient
> as possible, I'm not a long-time Python programmer, feel free to correct
> me, I' be glad to learn anything):
It's hard to do it right - this is another version:
http://www.effbot.org/zone/unicode-convert.htm
--
Gabriel Genellina
More information about the Python-list
mailing list