[ python-Bugs-1611131 ] \b in unicode regex gives strange results

Thu Dec 14 01:30:19 CET 2006

Bugs item #1611131, was opened at 2006-12-07 23:44
Message generated for change (Comment added) made by akaihola
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: Python 2.5
>Status: Deleted
>Resolution: Invalid
Priority: 5
Private: No
Submitted By: akaihola (akaihola)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: \b in unicode regex gives strange results

Initial Comment:
The problem: This doesn't give a match:
>>> re.match(r'ä\b', 'ä ', re.UNICODE)

This works ok and gives a match:
>>> re.match(r'.\b', 'ä ', re.UNICODE)

Both of these work as well:
>>> re.match(r'a\b', 'a ', re.UNICODE)
>>> re.match(r'.\b', 'a ', re.UNICODE)

Docs say \b is defined as an empty string between \w and \W. These do match accordingly:
>>> re.match(r'\w', 'ä', re.UNICODE)
>>> re.match(r'\w', 'a', re.UNICODE)
>>> re.match(r'\W', ' ', re.UNICODE)

So something strange happens in my first example, and I can't help but assume it's a bug.

----------------------------------------------------------------------

>Comment By: akaihola (akaihola)
Date: 2006-12-14 02:30

Message:
Logged In: YES 
user_id=1432932
Originator: YES

Ok so this does work:
>>> re.match(ur'ä\b', u'ä ', re.UNICODE)

If I understand correctly, I was comparing UTF-8 encoded strings in my
examples (my Ubuntu is UTF-8 by default) and regex special operators just
don't work in that domain.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-12-08 22:51

Message:
Logged In: YES 
user_id=849994
Originator: NO

FWIW, the first example works fine for me with and without Unicode
strings.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-12-08 19:18

Message:
Logged In: YES 
user_id=21627
Originator: NO

Notice that the re.UNICODE flag is only meaningful if you are using
Unicode strings; in the examples you give, you are using byte strings.

Please re-test with Unicode strings both as the expression and as the
string to match.

----------------------------------------------------------------------

Comment By: akaihola (akaihola)
Date: 2006-12-08 00:18

Message:
Logged In: YES 
user_id=1432932
Originator: YES

As a work-around I currently use a regex like r'ä(?=\W)'. Seems to work
ok.

Also, the \b problem doesn't seem to exist in the \W\w case, i.e. at the
beginning of words.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1611131&group_id=5470