Python and Cyrillic characters in regular expression
MRAB
google at mrabarnett.plus.com
Thu Sep 4 13:46:39 EDT 2008
On Sep 4, 3:42 pm, phasma <xpa... at gmail.com> wrote:
> Hi, I'm trying extract all alphabetic characters from string.
>
> reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
You don't need both (?u) and re.UNICODE: they mean the same thing.
This will actually match letters and whitespace.
> buf = re.match(string)
>
> But it's doesn't work. If string starts from Cyrillic character, all
> works fine. But if string starts from Latin character, match returns
> only Latin characters.
>
I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:
Program
=======
# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')
found = reg.match(u"ya я")
print found.group(1).encode("utf-8")
found = reg.match(u"я ya")
print found.group(1).encode("utf-8")
Output
======
ya я
я ya
More information about the Python-list
mailing list