Python and Cyrillic characters in regular expression

MRAB google at mrabarnett.plus.com
Thu Sep 4 13:46:39 EDT 2008


On Sep 4, 3:42 pm, phasma <xpa... at gmail.com> wrote:
> Hi, I'm trying extract all alphabetic characters from string.
>
> reg = re.compile('(?u)([\w\s]+)', re.UNICODE)

You don't need both (?u) and re.UNICODE: they mean the same thing.

This will actually match letters and whitespace.

> buf = re.match(string)
>
> But it's doesn't work. If string starts from Cyrillic character, all
> works fine. But if string starts from Latin character, match returns
> only Latin characters.
>

I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:

Program
=======
# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')

found = reg.match(u"ya я")
print found.group(1).encode("utf-8")

found = reg.match(u"я ya")
print found.group(1).encode("utf-8")

Output
======
ya я
я ya



More information about the Python-list mailing list