Python and Cyrillic characters in regular expression

MRAB google at
Thu Sep 4 19:46:39 CEST 2008

On Sep 4, 3:42 pm, phasma <xpa... at> wrote:
> Hi, I'm trying extract all alphabetic characters from string.
> reg = re.compile('(?u)([\w\s]+)', re.UNICODE)

You don't need both (?u) and re.UNICODE: they mean the same thing.

This will actually match letters and whitespace.

> buf = re.match(string)
> But it's doesn't work. If string starts from Cyrillic character, all
> works fine. But if string starts from Latin character, match returns
> only Latin characters.

I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:

# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')

found = reg.match(u"ya я")

found = reg.match(u"я ya")

ya я
я ya

More information about the Python-list mailing list