Regular expressions and non-standard character set

Tue Mar 20 13:02:36 EST 2001

Petri Mikael Kuittinen wrote:
> I want to match word boundaries using the special sequences \b and \B
> of regular expressions. They work OK when using the "standard"
> alphanumeric set [a-zA-Z0-9_]. But I would like them to work with
> character set which also contains various "national characters"
> e.g. å, ä, ö, è, é, ü, ñ etc. and their uppercase equivalents.
>
> Locale doesn't seem to be the proper way to do it

are you sure?

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'Swedish_Sweden.1252'
>>> import re
>>> re.findall(r"\b...\b", "spam, egg, bacon, and åäö")
['egg', 'and']
>>> re.findall(r"(?L)\b...\b", "spam, egg, bacon, and åäö")
['egg', 'and', 'åäö']

Cheers /F